Comments (9)
Hi San Kumar, thanks for sharing your script. I will definitely check this out over the course of the next week.
from goodreads-toolbox.
#!/usr/bin/env perl
#<--------------------------------- MAN PAGE --------------------------------->|
=pod
=head1 NAME
bookfinder - finding books by looking at bookshelves of people who read similar books
=head1 PURPOSE
=over
=item * fetches books with 4 and 5 stars in your profile
=item * crawls reviews of these books to find users who also rated it 4 or 5 stars
=item * looks up the bookshelves of those users to see which books they rated 4 or 5 stars
=item * ranks books based on number of votes from these users
=item * also ranks users by number of books they have in common (min 3)
=item * also gives more votes to users who love the same books as you but also hate the same books as you get special treatment
=back
=head1 SYNOPSIS
B<bookfinder.pl>
[B<-n> F<number>]
[B<-a> F<number>]
[B<-x> F<number>]
[B<-d> F<filename>]
[B<-u> F<number>]
[B<-c> F<numdays>]
[B<-o> F<filename>]
[B<-s> F<shelfname> ...]
[B<-i>]
F<goodloginmail> [F<goodloginpass>]
=head1 OPTIONS
Mandatory arguments to long options are mandatory for short options too.
=over 4
=item B<-n, --common>=F<number>
Max number of books in user's bookshelf. Currently set to
500. PEople who have hundreds and thousand of books often
add more noise than signal to your results.
=item B<-x, --rigor>=F<numlevel>
we need to find members who rate the books of our authors,
though Goodreads just shows a few ratings.
We exploit ratings filters and the reviews-search to find more members:
level 1 = filters-based search of book-raters (max 5400 ratings) - default
level 2 = like 1 plus dict-search if >3000 ratings with stall-time of 2min
level n = like 1 plus dict-search with stall-time of n minutes
Rigor level 0 is useless here (latest readers only),
and 2+ (dict-search) has a bad cost/benefit ratio given hundreds of books.
=item B<-d, --dict>=F<filename>
default is F<./list-in/dict.lst>
=item B<-u, --userid>=F<number>
check another member instead of the one identified by the login-mail
and password arguments. You find the ID by looking at the shelf URLs.
=item B<-c, --cache>=F<numdays>
number of days to store and reuse downloaded data in F</tmp/FileCache/>,
default is 31 days. This helps with cheap recovery on a crash, power blackout
or pause, and when experimenting with parameters. Loading data from Goodreads
is a very time consuming process.
=item B<-o, --outfile>=F<filename>
name of the CSV file where we write results to, default is
"./likeminded-F<goodusernumber>-F<shelfname>.csv"
=item B<-i, --ignore-errors>
Don't retry on errors, just keep going.
Sometimes useful if a single Goodreads resource hangs over long periods
and you're okay with some values missing in your result.
This option is not recommended when you run the program unattended.
=item B<-?, --help>
show full man page
=back
=head1 FILES
F<./list-in/dict.lst>
F<./list-out/likeminded-$USERID-$SHELF.html>
F</tmp/FileCache/>
=head1 EXAMPLES
$ ./bookfinder.pl [email protected] MyPASSword
$ ./bookfinder.pl -c 31 -o myfile.csv [email protected] pass
=head1 REPORTING BUGS
Report bugs to <[email protected]> or use Github's issue tracker
L<https://github.com/andre-st/goodreads-toolbox/issues>
=head1 COPYRIGHT
This is free software. You may redistribute copies of it under the terms of
the GNU General Public License L<https://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.
=head1 VERSION
2020-01-23 (Since 2018-06-22)
=cut
#<--------------------------------- 79 chars --------------------------------->|
use strict;
use warnings qw(all);
use locale;
use 5.18.0;
# Perl core:
use FindBin;
use lib "$FindBin::Bin/lib/";
use Time::HiRes qw(time tv_interval);
use POSIX qw(strftime floor locale_h);
use File::Spec; # Platform indep. directory separator
use IO::File;
use Getopt::Long;
use Pod::Usage;
# Third party:
use Text::CSV;
# Ours:
use Goodscrapes;
# ----------------------------------------------------------------------------
# Program configuration:
#
setlocale(LC_CTYPE, "en_US"); # GR dates all en_US
STDOUT->autoflush(1);
gsetopt(cache_days => 31);
our $TSTART = time();
our $MINCOMMON = 5;
our $MAXAUBOOKS = 100;
our $RIGOR = 1;
our $MAXBOOKS = 500;
our $DICTPATH = File::Spec->catfile($FindBin::Bin, 'list-in', 'dict.lst');
our $OUTPATH;
our @SHELVES;
our $USERID;
GetOptions('rigor|x=i' => \$RIGOR,
'dict|d=s' => \$DICTPATH,
'userid|u=s' => \$USERID,
'outfile|o=s' => \$OUTPATH,
'maxbooks|n=s' => \$MAXBOOKS,
'shelf|s=s' => \@SHELVES,
'ignore-errors|i' => sub {gsetopt(ignore_errors => 1);},
'cache|c=i' => sub {gsetopt(cache_days => $_[1]);},
'help|?' => sub {pod2usage(-verbose => 2);})
or pod2usage(1);
pod2usage(1) if !$ARGV[0];
glogin(usermail => $ARGV[0], # Login also allows to load 200 books in 1 request
userpass => $ARGV[1], # Asks pw if omitted
r_userid => \$USERID);
sub bookshelf {
my $id = shift;
my %books;
print "\nLooking bookshelf of $id..";
greadshelf(from_user_id => $id,
ra_from_shelves => [ 'read' ],
rh_into => \%books,
# on_book => sub{},
on_progress => gmeter('books')
);
my (@good, @bad);
for my $book_id (keys %books) {
my $book = $books{$book_id};
#next unless $book->{title} =~ /Club/;
my $rating = $book->{user_rating};
push(@good, $book) if ($rating >= 4);
push(@bad, $book) if ($rating <= 2);
#warn("cannot find rating for $book->{title} of $id\n") unless ($rating >= 1);
}
return (\@good, \@bad);
}
sub bookgenres {
my $bid = shift;
my $html = Goodscrapes::_html(Goodscrapes::_book_url($bid));
my @genres;
while ($html =~ m[href="/genres/([\w-]+)"]g) {
push(@genres, $1);
}
return \@genres;
}
my ($su_good, $su_bad) = bookshelf($USERID);
my (%good_users, %good_books, %haters);
for my $b (@$su_good) {
print "\nLooking up reviews for for $b->{title}..";
$b->{reviews} = {};
greadreviews(rh_for_book => $b,
rh_into => $b->{reviews},
rigor => $RIGOR,
dict_path => $DICTPATH,
on_progress => gmeter('memb'));
for my $rev (values %{$b->{reviews}}) {
my $u = $rev->{rh_user};
if ($rev->{rating} >= 4) {
$good_users{$u->{id}} = { 'votes' => (defined($good_users{$u->{id}}->{votes}) ? $good_users{$u->{id}}->{votes} : 0) + 1, 'user' => $u };
} elsif ($rev->{rating} <= 2) {
$haters{$u->{id}} = { 'votes' => (defined($haters{$u->{id}}->{votes}) ? $haters{$u->{id}}->{votes} : 0) + 1, 'user' => $u };
}
}
}
for my $u (keys %good_users) {
$good_users{$u}->{'bad'} = defined($haters{$u}->{votes}) ? $haters{$u}->{votes} : 0;
}
printf("\nHere are your best users (out of %d users):\n", scalar keys %good_users);
my $filename = File::Spec->catfile($FindBin::Bin, 'list-out', "bookfinder-users.csv");
my $csv = Text::CSV->new({ binary => 1, eol => $/ }) or die "Failed to create a CSV handle: $!";
open my $fh, ">:encoding(utf8)", $filename or die "failed to create $filename: $!";
$csv->print($fh, [ 'uid', 'name', 'good_common', 'bad_common', 'total_common', 'total_books', 'ratio', 'url' ]);
for my $user_id (keys %good_users) {
my $userHash = $good_users{$user_id};
if (($user_id ne $USERID) && ($userHash->{votes} >= 2)) {
my $user = $userHash->{user};
my $uBooks = bookshelf($user_id);
my $numBooks = scalar @$uBooks;
if (!$MAXBOOKS || ($numBooks <= $MAXBOOKS)) {
my $total = $userHash->{votes} + $userHash->{bad};
$csv->print($fh, [ $user->{id}, $user->{name}, $userHash->{votes}, $userHash->{bad}, $total, $numBooks, $numBooks > 0 ? $total / $numBooks : 0, "https://www.goodreads.com/review/list/$user_id?sort=rating" ]);
for my $gb (@$uBooks) {
$good_books{$gb->{id}} = { 'votes' => (defined($good_books{$gb->{id}}->{votes}) ? $good_books{$gb->{id}}->{votes} : 0) + 1, 'book' => $gb };
}
} else {
print "\nskipped books for $user_id: $numBooks > $MAXBOOKS\n";
}
}
}
close $fh or die "failed to close $filename: $!";
printf("\nHere are your best books (out of %d books):\n", scalar keys %good_books);
$OUTPATH = File::Spec->catfile($FindBin::Bin, 'list-out', "bookfinder-books.csv") if !$OUTPATH;
$csv = Text::CSV->new({ binary => 1, eol => $/ }) or die "Failed to create a CSV handle: $!";
open $fh, ">:encoding(utf8)", $OUTPATH or die "failed to create $OUTPATH: $!";
$csv->print($fh, [ 'bid', 'title', 'author', 'votes', 'avg_rating', 'num_ratings', 'genres', 'img_url' ]);
for my $bk (sort {$b->{votes} <=> $a->{votes}} values(%good_books)) {
if ($bk->{votes} > 1) {
my $b = $bk->{book};
my $genres = bookgenres($b->{id});
printf("%s with %d votes\n", $b->{title}, $bk->{votes});
$csv->print($fh, [ $b->{id}, $b->{title}, $b->{rh_author}->{name}, $bk->{votes}, $b->{avg_rating}, $b->{num_ratings}, join(', ', @$genres), $b->{img_url} ]);
}
}
close $fh or die "failed to close $OUTPATH: $!";
from goodreads-toolbox.
For this to work, there is a minor patch in Goodscrapes.pm
line 2075:
$bk{ user_rating } = $row =~ /data-rating="(\d+)"/ ? ($1?$1:0) : 0;
I guess goodreads has changed the HTML so the user rating is always 0. The above line fixes it.
from goodreads-toolbox.
Super like
from goodreads-toolbox.
This is exactly what I've been looking for! Can it be run in Docker?
from goodreads-toolbox.
This is exactly what I've been looking for! Can it be run in Docker?
I haven't tried it but shouldn't be so hard. Just modify goodreads-toolbox Dockerfile
to copy this script to the container and the rest should be the same.
from goodreads-toolbox.
I added your script and the patch to the goodreads-toolbox directory and then modified the .dockerignore file to include the new script in the exceptions list, then rebuilt the container from my local drive instead of pointing to github in the build command. However, it seems to have broken my bash prompt and I get "no such file or directory" when trying to run any of the scripts in the container. Oh well! I'm not a Linux programmer and have never messed around with Docker before until today. I realize this isn't a Docker help forum, however if you happen to have any tips I would love to hear them. Thank you for your awesome work on this! I hope the toolbox will be supported again one day and this can be added as an official script.
from goodreads-toolbox.
I think your Dockerfile
may be missing the entrypoint. I haven't tried this in docker yet, haven't seen the Dockerfile yet (will maybe check on the weekend) but you need to copy-paste the entry point from the original Dockerfile
in to the modified file. Otoh you don't want to mess with Dockerfile then you can just mount a volume (with the -v command) and put this script there. Then use docker exec -it $pid bash
to enter the container and just do a perl script.pl
. Sorry I'm typing all this from memory so you may have to do some digging around but I reckon these should both work.
from goodreads-toolbox.
Thanks again for your help! For anyone who stumbles across this in future, here are all the steps I took to eventually get this working in Docker for Windows:
- Clone the repo
- Paste @san-kumar 's script into a new blank text file called
bookfinder.pl
- Replace line 205 with the following:
use local::lib "$FindBin::Bin/lib/local/"; use lib "$FindBin::Bin/lib/";
- Patch
/lib/Goodscrapes.pm
as @san-kumar mentions above - Add
perl-text-csv \
at line 47 inDockerfile
- Add
!/bookfinder.pl
anywhere in.dockerignore
- Open a command prompt and
cd
to the repo directory - Enter
docker build -t goodreads-toolbox .
and wait for the build to complete - Enter
docker run -it --publish=8080:80 goodreads-toolbox
- At the bash prompt, run
perl bookfinder.pl
That's it!
from goodreads-toolbox.
Related Issues (20)
- Won't read shelves with dashes HOT 2
- recentrated: Distribute shelf-checks over n days, if > 100 books
- friendrated: Most hated books among friends and followees
- Getting the GR cookie is not user-friendly HOT 1
- Add a troubleshooting / FAQ section somewhere
- New program: Members popular among your friends
- friendrated: Don't list books that I've already read
- friendrated: Output most signifcant instead of most faved books HOT 1
- Create dockerfile HOT 2
- likeminded.pl: also take into how similar other users rate books HOT 4
- Unshelved books of favorite authors HOT 2
- savreviews.pl: Reviewer demographics
- Upload the docker container to dockerhub HOT 1
- Add tool for find people read same books HOT 2
- Q: Goodreads website redesign. Will this (goodreads-toolbox) still work, or "what's the future?" HOT 2
- GR login via library currently broken HOT 1
- If someone can fix the login bug, post it here HOT 3
- Error: IO::Socket::SSL 1.42 and Net::SSLeay 1.49 must be installed for https support HOT 1
- friendrated.pl returns only books I have already read, gets ratings wrong HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from goodreads-toolbox.