Giter Club home page Giter Club logo

Comments (9)

andre-st avatar andre-st commented on May 29, 2024 1

Hi San Kumar, thanks for sharing your script. I will definitely check this out over the course of the next week.

from goodreads-toolbox.

san-kumar avatar san-kumar commented on May 29, 2024
#!/usr/bin/env perl

#<--------------------------------- MAN PAGE --------------------------------->|

=pod

=head1 NAME

bookfinder - finding books by looking at bookshelves of people who read similar books

=head1 PURPOSE

=over

=item * fetches books with 4 and 5 stars in your profile

=item * crawls reviews of these books to find users who also rated it 4 or 5 stars

=item * looks up the bookshelves of those users to see which books they rated 4 or 5 stars

=item * ranks books based on number of votes from these users

=item * also ranks users by number of books they have in common (min 3)

=item * also gives more votes to users who love the same books as you but also hate the same books as you get special treatment

=back

=head1 SYNOPSIS

B<bookfinder.pl>
[B<-n> F<number>]
[B<-a> F<number>] 
[B<-x> F<number>] 
[B<-d> F<filename>] 
[B<-u> F<number>] 
[B<-c> F<numdays>] 
[B<-o> F<filename>] 
[B<-s> F<shelfname> ...] 
[B<-i>]
F<goodloginmail> [F<goodloginpass>]


=head1 OPTIONS

Mandatory arguments to long options are mandatory for short options too.

=over 4

=item B<-n, --common>=F<number>

Max number of books in user's bookshelf. Currently set to
500. PEople who have hundreds and thousand of books often
add more noise than signal to your results.


=item B<-x, --rigor>=F<numlevel>

we need to find members who rate the books of our authors, 
though Goodreads just shows a few ratings. 
We exploit ratings filters and the reviews-search to find more members:

 level 1 = filters-based search of book-raters (max 5400 ratings) - default
 level 2 = like 1 plus dict-search if >3000 ratings with stall-time of 2min
 level n = like 1 plus dict-search with stall-time of n minutes

Rigor level 0 is useless here (latest readers only), 
and 2+ (dict-search) has a bad cost/benefit ratio given hundreds of books.


=item B<-d, --dict>=F<filename>

default is F<./list-in/dict.lst>


=item B<-u, --userid>=F<number>

check another member instead of the one identified by the login-mail 
and password arguments. You find the ID by looking at the shelf URLs.


=item B<-c, --cache>=F<numdays>

number of days to store and reuse downloaded data in F</tmp/FileCache/>,
default is 31 days. This helps with cheap recovery on a crash, power blackout 
or pause, and when experimenting with parameters. Loading data from Goodreads
is a very time consuming process.


=item B<-o, --outfile>=F<filename>

name of the CSV file where we write results to, default is
"./likeminded-F<goodusernumber>-F<shelfname>.csv"


=item B<-i, --ignore-errors>

Don't retry on errors, just keep going. 
Sometimes useful if a single Goodreads resource hangs over long periods 
and you're okay with some values missing in your result.
This option is not recommended when you run the program unattended.




=item B<-?, --help>

show full man page

=back


=head1 FILES

F<./list-in/dict.lst>

F<./list-out/likeminded-$USERID-$SHELF.html>

F</tmp/FileCache/>


=head1 EXAMPLES

$ ./bookfinder.pl [email protected] MyPASSword

$ ./bookfinder.pl -c 31 -o myfile.csv  [email protected] pass


=head1 REPORTING BUGS

Report bugs to <[email protected]> or use Github's issue tracker
L<https://github.com/andre-st/goodreads-toolbox/issues>


=head1 COPYRIGHT

This is free software. You may redistribute copies of it under the terms of
the GNU General Public License L<https://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.



=head1 VERSION

2020-01-23 (Since 2018-06-22)

=cut

#<--------------------------------- 79 chars --------------------------------->|


use strict;
use warnings qw(all);
use locale;
use 5.18.0;

# Perl core:
use FindBin;
use lib "$FindBin::Bin/lib/";
use Time::HiRes qw(time tv_interval);
use POSIX qw(strftime floor locale_h);
use File::Spec; # Platform indep. directory separator
use IO::File;
use Getopt::Long;
use Pod::Usage;
# Third party:
use Text::CSV;
# Ours:
use Goodscrapes;


# ----------------------------------------------------------------------------
# Program configuration:
#
setlocale(LC_CTYPE, "en_US"); # GR dates all en_US
STDOUT->autoflush(1);
gsetopt(cache_days => 31);

our $TSTART = time();
our $MINCOMMON = 5;
our $MAXAUBOOKS = 100;
our $RIGOR = 1;
our $MAXBOOKS = 500;
our $DICTPATH = File::Spec->catfile($FindBin::Bin, 'list-in', 'dict.lst');
our $OUTPATH;
our @SHELVES;
our $USERID;

GetOptions('rigor|x=i'          => \$RIGOR,
    'dict|d=s'           => \$DICTPATH,
    'userid|u=s'         => \$USERID,
    'outfile|o=s'        => \$OUTPATH,
    'maxbooks|n=s'       => \$MAXBOOKS,
    'shelf|s=s'          => \@SHELVES,
    'ignore-errors|i'    => sub {gsetopt(ignore_errors => 1);},
    'cache|c=i'          => sub {gsetopt(cache_days => $_[1]);},
    'help|?'             => sub {pod2usage(-verbose => 2);})
    or pod2usage(1);

pod2usage(1) if !$ARGV[0];

glogin(usermail => $ARGV[0], # Login also allows to load 200 books in 1 request
    userpass    => $ARGV[1], # Asks pw if omitted
    r_userid    => \$USERID);

sub bookshelf {
    my $id = shift;
    my %books;

    print "\nLooking bookshelf of $id..";

    greadshelf(from_user_id => $id,
        ra_from_shelves     => [ 'read' ],
        rh_into             => \%books,
        # on_book       => sub{},
        on_progress         => gmeter('books')
    );

    my (@good, @bad);
    for my $book_id (keys %books) {
        my $book = $books{$book_id};
        #next unless $book->{title} =~ /Club/;

        my $rating = $book->{user_rating};
        push(@good, $book) if ($rating >= 4);
        push(@bad, $book) if ($rating <= 2);

        #warn("cannot find rating for $book->{title} of $id\n") unless ($rating >= 1);
    }

    return (\@good, \@bad);
}

sub bookgenres {
    my $bid = shift;
    my $html = Goodscrapes::_html(Goodscrapes::_book_url($bid));
    my @genres;
    while ($html =~ m[href="/genres/([\w-]+)"]g) {
        push(@genres, $1);
    }

    return \@genres;
}

my ($su_good, $su_bad) = bookshelf($USERID);
my (%good_users, %good_books, %haters);

for my $b (@$su_good) {
    print "\nLooking up reviews for for $b->{title}..";
    $b->{reviews} = {};
    greadreviews(rh_for_book => $b,
        rh_into              => $b->{reviews},
        rigor                => $RIGOR,
        dict_path            => $DICTPATH,
        on_progress          => gmeter('memb'));

    for my $rev (values %{$b->{reviews}}) {
        my $u = $rev->{rh_user};
        if ($rev->{rating} >= 4) {
            $good_users{$u->{id}} = { 'votes' => (defined($good_users{$u->{id}}->{votes}) ? $good_users{$u->{id}}->{votes} : 0) + 1, 'user' => $u };
        } elsif ($rev->{rating} <= 2) {
            $haters{$u->{id}} = { 'votes' => (defined($haters{$u->{id}}->{votes}) ? $haters{$u->{id}}->{votes} : 0) + 1, 'user' => $u };
        }
    }
}

for my $u (keys %good_users) {
    $good_users{$u}->{'bad'} = defined($haters{$u}->{votes}) ? $haters{$u}->{votes} : 0;
}

printf("\nHere are your best users (out of %d users):\n", scalar keys %good_users);
my $filename = File::Spec->catfile($FindBin::Bin, 'list-out', "bookfinder-users.csv");
my $csv = Text::CSV->new({ binary => 1, eol => $/ }) or die "Failed to create a CSV handle: $!";
open my $fh, ">:encoding(utf8)", $filename or die "failed to create $filename: $!";

$csv->print($fh, [ 'uid', 'name', 'good_common', 'bad_common', 'total_common', 'total_books', 'ratio', 'url' ]);

for my $user_id (keys %good_users) {
    my $userHash = $good_users{$user_id};

    if (($user_id ne $USERID) && ($userHash->{votes} >= 2)) {
        my $user = $userHash->{user};
        my $uBooks = bookshelf($user_id);
        my $numBooks = scalar @$uBooks;

        if (!$MAXBOOKS || ($numBooks <= $MAXBOOKS)) {
            my $total = $userHash->{votes} + $userHash->{bad};
            $csv->print($fh, [ $user->{id}, $user->{name}, $userHash->{votes}, $userHash->{bad}, $total, $numBooks, $numBooks > 0 ? $total / $numBooks : 0, "https://www.goodreads.com/review/list/$user_id?sort=rating" ]);

            for my $gb (@$uBooks) {
                $good_books{$gb->{id}} = { 'votes' => (defined($good_books{$gb->{id}}->{votes}) ? $good_books{$gb->{id}}->{votes} : 0) + 1, 'book' => $gb };
            }
        } else {
            print "\nskipped books for $user_id: $numBooks > $MAXBOOKS\n";
        }
    }
}

close $fh or die "failed to close $filename: $!";

printf("\nHere are your best books (out of %d books):\n", scalar keys %good_books);
$OUTPATH = File::Spec->catfile($FindBin::Bin, 'list-out', "bookfinder-books.csv") if !$OUTPATH;

$csv = Text::CSV->new({ binary => 1, eol => $/ }) or die "Failed to create a CSV handle: $!";
open $fh, ">:encoding(utf8)", $OUTPATH or die "failed to create $OUTPATH: $!";

$csv->print($fh, [ 'bid', 'title', 'author', 'votes', 'avg_rating', 'num_ratings', 'genres', 'img_url' ]);

for my $bk (sort {$b->{votes} <=> $a->{votes}} values(%good_books)) {
    if ($bk->{votes} > 1) {
        my $b = $bk->{book};
        my $genres = bookgenres($b->{id});
        printf("%s with %d votes\n", $b->{title}, $bk->{votes});
        $csv->print($fh, [ $b->{id}, $b->{title}, $b->{rh_author}->{name}, $bk->{votes}, $b->{avg_rating}, $b->{num_ratings}, join(', ', @$genres), $b->{img_url} ]);
    }
}

close $fh or die "failed to close $OUTPATH: $!";

from goodreads-toolbox.

san-kumar avatar san-kumar commented on May 29, 2024

For this to work, there is a minor patch in Goodscrapes.pm line 2075:

$bk{ user_rating     } = $row =~            /data-rating="(\d+)"/                   ? ($1?$1:0) : 0;

I guess goodreads has changed the HTML so the user rating is always 0. The above line fixes it.

from goodreads-toolbox.

WaterSibilantFalling avatar WaterSibilantFalling commented on May 29, 2024

Super like

from goodreads-toolbox.

mcleanle avatar mcleanle commented on May 29, 2024

This is exactly what I've been looking for! Can it be run in Docker?

from goodreads-toolbox.

san-kumar avatar san-kumar commented on May 29, 2024

This is exactly what I've been looking for! Can it be run in Docker?

I haven't tried it but shouldn't be so hard. Just modify goodreads-toolbox Dockerfile to copy this script to the container and the rest should be the same.

from goodreads-toolbox.

mcleanle avatar mcleanle commented on May 29, 2024

I added your script and the patch to the goodreads-toolbox directory and then modified the .dockerignore file to include the new script in the exceptions list, then rebuilt the container from my local drive instead of pointing to github in the build command. However, it seems to have broken my bash prompt and I get "no such file or directory" when trying to run any of the scripts in the container. Oh well! I'm not a Linux programmer and have never messed around with Docker before until today. I realize this isn't a Docker help forum, however if you happen to have any tips I would love to hear them. Thank you for your awesome work on this! I hope the toolbox will be supported again one day and this can be added as an official script.

from goodreads-toolbox.

san-kumar avatar san-kumar commented on May 29, 2024

I think your Dockerfile may be missing the entrypoint. I haven't tried this in docker yet, haven't seen the Dockerfile yet (will maybe check on the weekend) but you need to copy-paste the entry point from the original Dockerfile in to the modified file. Otoh you don't want to mess with Dockerfile then you can just mount a volume (with the -v command) and put this script there. Then use docker exec -it $pid bash to enter the container and just do a perl script.pl. Sorry I'm typing all this from memory so you may have to do some digging around but I reckon these should both work.

from goodreads-toolbox.

mcleanle avatar mcleanle commented on May 29, 2024

Thanks again for your help! For anyone who stumbles across this in future, here are all the steps I took to eventually get this working in Docker for Windows:

  1. Clone the repo
  2. Paste @san-kumar 's script into a new blank text file called bookfinder.pl
  3. Replace line 205 with the following: use local::lib "$FindBin::Bin/lib/local/"; use lib "$FindBin::Bin/lib/";
  4. Patch /lib/Goodscrapes.pm as @san-kumar mentions above
  5. Add perl-text-csv \ at line 47 in Dockerfile
  6. Add !/bookfinder.pl anywhere in .dockerignore
  7. Open a command prompt and cd to the repo directory
  8. Enter docker build -t goodreads-toolbox . and wait for the build to complete
  9. Enter docker run -it --publish=8080:80 goodreads-toolbox
  10. At the bash prompt, run perl bookfinder.pl

That's it!

from goodreads-toolbox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.