Giter Club home page Giter Club logo

httrack's People

Contributors

brianredbeard avatar cicku avatar e-ht avatar fornwall avatar fweimer-rh avatar glogiotatidis avatar gpunktschmitz avatar jackdanger avatar jayaddison avatar mice7r avatar pablocastellano avatar scootergrisen avatar sickcodes avatar smokris avatar soundasleep avatar xroche avatar zmodem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

httrack's Issues

Redirect to the same location with cookie is not supported

What steps will reproduce the problem?
1. Download a site redirecting to the same page, or to an already crawled page, 
with different cookie settings

What is the expected output? What do you see instead?
The page should be downloaded again with the new cookie settings, but it is not 
because of the page dedup, which is based on URI/GET parameters only.

Please use labels and text to provide additional information.
See http://forum.httrack.com/readmsg/31101


Original issue reported on code.google.com by xroche on 6 Jun 2013 at 8:18

onMouseOver="src='image'" is not recognized

What steps will reproduce the problem?
1. Image with onXXX properties such as 
<img onMouseOver="src='i/1_home.gif'" onMouseOut="src='i/2_home.gif'"
alt='Home' src='i/a_home.gif'>
2. Crawl the page

What is the expected output? What do you see instead?
The two OnXXX related images are not captured


Original issue reported on code.google.com by xroche on 28 Feb 2013 at 8:25

Add torrent MIME type to the built-in list

What steps will reproduce the problem?
1. Try to mirror a website that dynamically serves .torrent files

What is the expected behavior ? What do you get instead?
HTTrack should treat the content as files, but it doesn't because it doesn't 
recognize this particular MIME type.

What version of httrack are you using? On what operating system?
HTTrack version 3.47-21+libhtsjava.so.2 on Debian Wheezy

Trivial patch attached.

Original issue reported on code.google.com by [email protected] on 14 Jul 2013 at 6:44

Attachments:

Referer not set to the page from which the link is followed

What steps will reproduce the problem?
1. Mirror a website with depth -r3 and with --debug-headers

What is the expected behavior ? What do you get instead?
All Referer headers in hts-ioinfo.txt have the same referer which is the 
starting URL, even though not all pages are reachable in one hop from the 
starting URL. Expected is for Referer to be set to the URL of the page from 
which the link was followed.

What version of httrack are you using? On what operating system?
HTTrack version 3.47-21+libhtsjava.so.2 on Debian Wheezy.

For a moment there I thought the server in question was checking Referer 
headers, so I made a patch for this, but it turned out that the Referer headers 
did not matter in that case. Here's the one-line patch anyway.

Original issue reported on code.google.com by [email protected] on 14 Jul 2013 at 6:40

Attachments:

Notepad shortcut was changed

What steps will reproduce the problem?
1. Open any txt file which was saved.
2. Middle click at the Notepad button of Taskbar.
3. The readme of HTTRACK will be appeared instead of the blank.

What is the expected behavior ? What do you get instead?
I expect the blank of Notepad. I get the readme of HTTRACK instead. I don't 
know how to get back after trying. Luckily, uninstalling helps this.

What version of httrack are you using? On what operating system?
3.47.27 from the Product version of the installer exe file.
I'm using Windows 8.1 64-bit Professtional.

Please provide any additional information below.
Please don't change this shortcut. That's why I don't like to allow Privilege 
Elevation to any app.

Original issue reported on code.google.com by [email protected] on 17 Mar 2014 at 3:55

Bogus charset for requests when filenames have non-ascii characters (RFC 3986)

What steps will reproduce the problem?
1. Download a page containing links with filenames embedding non-ascii 
characters

What is the expected output? What do you see instead?
Bad request sent to the server because of buggy encoding. According to RFC 
3986, UTF-8 should be used with URL-encoding.

What version of the product are you using? On what operating system?
3.47.12

Please provide any additional information below.
Reported by Steven Hsiao (http://forum.httrack.com/readmsg/31050/index.html)

Original issue reported on code.google.com by xroche on 18 May 2013 at 4:48

Homebrew SHA1 Mismatch

What steps will reproduce the problem?
'brew install httrack'

What is the expected behavior ? What do you get instead?
Expected download and installation of HTTrack software, following error was 
received instead:

┌─[peter@foo] - [~] - [Thu Aug 01, 12:44]
└─[$] <> brew install httrack
==> Downloading http://download.httrack.com/httrack-3.46.1.tar.gz
######################################################################## 100.0%
Error: SHA1 mismatch
Expected: be6328d2ff3cbabd21426b7acc54edcf1ebb76e0
Actual: 2ba3da7784bcd67ff98ff09c419cfb700c97ba5b
Archive: /Library/Caches/Homebrew/httrack-3.46.1.tar.gz
(To retry an incomplete download, remove the file above.)

What version of httrack are you using? On what operating system?
httrack-3.46.1 on OS X (v10.8.4), 

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 1 Aug 2013 at 7:49

Window to Front (r1290) not Working Correctly

What steps will reproduce the problem?
1. This rev introduced a change to try and prevent the WinHTTrack window from 
stealing focus: https://code.google.com/p/httrack/source/detail?r=1290
2. There are two user cases to test this change. First, WinHTTrack is minimised 
to the taskbar when it completes a mirror. Second, WinHTTrack is not minimised 
to the taskbar - but is not the active window - when it completes a mirror.
3. The fix only works for user case one, the problem still exists with user 
case two.

What is the expected behavior ? What do you get instead?
The WinHTTrack window will not force itself to the top when a mirror completes. 
Instead, it still does so.

What version of httrack are you using? On what operating system?
v3.48.19, Win 7 64bit.

Please provide any additional information below.
Forum thread is here: http://forum.httrack.com/readmsg/33162/index.html
As stated, one user case has been fixed - so progress has been made, but the 
other user case is still a problem.

Original issue reported on code.google.com by [email protected] on 5 Aug 2014 at 11:58

WinHttrack crashes

What steps will reproduce the problem?
1. Downloading some Web-Adresses
2. WinHttrack crashes always after some Time

What is the expected behavior ? What do you get instead?
I'm expecting NO crash

What version of httrack are you using? On what operating system?
WinHttrack 3.48.3, german
Windows XP

Please provide any additional information below.
Message in crashes.txt"
HTTrack 3.48.3 closed at '..\httrack\htsinthash.c', line 788
Reason:
assert failed: ! "hashtable internal error: cuckoo/stash collision"

Original issue reported on code.google.com by [email protected] on 4 May 2014 at 1:37

httrack can't parse some very advanced js

You probably already known that httrack is unable to emulate browser behaviour 
in some kinds of javascripts, but I'm reporting this if you want to play a bit 
on it:

URL: http://hemerotecadigital.bn.br/acervo-digital/norte-goyaz/120685

What is expected? Download the HTML page and all related .PDF files

What occurs? Download only the HTML page, removing the <base href/> tag with no 
further retrievals

I'm running WinHTTrack Website Copier 3.47-27 on Win7 SP1. I can't share full 
settings for this project, but it is set to "Get non-HTML files related to a 
link" and with a bunch of +http://memoria.bn.br/* filters

Original issue reported on code.google.com by [email protected] on 7 Apr 2014 at 10:41

Error when decompressing

What steps will reproduce the problem?
1. Try download http://gz.ifeng.com/zaobanche/detail_2014_06/14/2429049_0.shtml

What is the expected behavior ? What do you get instead?
Should download correctly. Stopped and error reported.

What version of httrack are you using? On what operating system?
HTTrack3.48-13+htsswf+htsjava

Please provide any additional information below.

HTTrack3.48-13+htsswf+htsjava launched on Tue, 17 Jun 2014 21:55:17 at 
http://gz.ifeng.com/zaobanche/detail_2014_06/14/2429049_0.shtml -* +*.png 
+*.gif +*.jpg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar 
+http://gz.ifeng.com/zaobanche/detail_2014_06/14/2429049_*.shtml
(winhttrack -qwC2%Ps0u1%s%uN0%I0p3DaK0H0%kf2A25000%f#f -F "Mozilla/4.5 
(compatible; HTTrack 3.0x; Windows 98)" -%F "<!-- Mirrored from %s%s by HTTrack 
Website Copier/3.x [XR&CO'2013], %s -->" -%l "en, *" 
http://gz.ifeng.com/zaobanche/detail_2014_06/14/2429049_0.shtml -O1 "C:\My Web 
Sites\娘子军舞蹈用腿开枪网络爆红" -* +*.png +*.gif +*.jpg +*.css 
+*.js -ad.doubleclick.net/* -mime:application/foobar 
+http://gz.ifeng.com/zaobanche/detail_2014_06/14/2429049_*.shtml )
Information, Warnings and Errors reported for this mirror:

21:55:19 Error:  "Error when decompressing" (-1) at link 
gz.ifeng.com/zaobanche/detail_2014_06/14/2429049_0.shtml (from primary/primary)

21:55:19 Warning:  No data seems to have been transferred during this session! 
: restoring previous one!

similar to issue38 https://code.google.com/p/httrack/issues/detail?id=38
Not limited to this website.  Many others also met with the same error.

Original issue reported on code.google.com by [email protected] on 17 Jun 2014 at 7:57

Not HTML file extensions wrongly renamed to html

Enclosed is a sample project "test-2" in which some files (not HTML) are 
renamed locally with ".html" extension.
This prevents from showing them correctly to many browsers.
Many files with ".html"-extension are in reality PDF, at example in folder 
"test-2\eco.uninsubria.it\webdocenti\amira\inferenza\":
"12set01_inf.html", "26-mar-02A.html", "2lug01_inf.html", ...
Observation: I have substituted all real-PDF files with 0-dimensional files, 
excepted "12set01_inf.html".

A second less important problem:
Always in the same project, there are folders that are empty or contain only 
other folders without files.
There are some HTML files (real HTML this time), which have links to not 
existent local files, why not downloaded.
For a file does not exist locally, even for some error in downloading, the 
corresponding links in the HTML file should be absolute like http://.
At example in project "test-2":
Folder "test-2\aim.unipv.it\" does not have data, only subfolders.
File "test-2\eco.uninsubria.it\webdocenti\amira\inferenza\prog.html" has link 
href="../../../../aim.unipv.it/_anto/prog-andati.ps".
"prog.html" should instead have link 
href="http://aim.unipv.it/~anto/prog-andati.ps", because the file 
"prog-andati.ps" does not exist locally.

Version of httrack: 3.47-27
My operating system is Windows 7.

Original issue reported on code.google.com by [email protected] on 16 Oct 2013 at 2:46

Attachments:

AArch64 Support

Hi,

Does httrack support aarch64 now?

Thanks.

Original issue reported on code.google.com by Cickumqt on 13 Sep 2013 at 5:53

Limit the final destination path lenght to 256 (Windows compatibility)

I'm trying to mirror websites powered by wordpress. But due to the very large 
URLs some pages have (such as 
www.ambiente.sp.gov.br/cea/guia-bibliografico/bases-para-conservacao-e-uso-suste
ntavel-do-cerrado-paulistasecretaria-de-estado-do-meio-ambiente-smaprograma-esta
dual-para-a-conservacao-da-biodiversidade-probio/ ) and the Windows limit on 
folders+file names characters, it generates lot's of "serialize error"s

The build options available at http://www.httrack.com/html/fcguide.html on't 
satisfy my needs, so I'm asking to add a new feature:

Place html pages on site_name/web/randonnames and all others files on the 
default full url name

Original issue reported on code.google.com by [email protected] on 4 May 2013 at 1:31

i cant download php page

What steps will reproduce the problem?
1.help us
2.
3.

What is the expected behavior ? What do you get instead?


What version of httrack are you using? On what operating system?


Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 25 Sep 2013 at 12:24

Add support to site on multiples domains

Some websites are available on more than one domain without it being a mirror, 
only a website with more than one possible domain (example: somewebsite.com and 
websitewebsite.com displays the exactly same content osted on exactly same 
server)

Please add a feature that allows the users to indicate such cases to httrack 
and to act according it without getting the same page/file in both URLs

Original issue reported on code.google.com by [email protected] on 24 Jun 2013 at 1:46

-%N0 causig "file not stored in cache due to bogus state" errors

What steps will reproduce the problem?
1. httrack https://twitter.com/TSBible -%N0

What is the expected behavior ? What do you get instead?

20:43:46 Warning:  file not stored in cache due to bogus state (broken size, 
expected 217392 got 888): https://twitter.com/TSBible?lang=gl
20:43:46 Warning:  file not stored in cache due to bogus state (broken size, 
expected 217525 got 908): https://twitter.com/TSBible?lang=it


What version of httrack are you using? On what operating system?
3.48.3

Please provide any additional information below.
Reported by http://forum.httrack.com/readmsg/32672/index.html


Original issue reported on code.google.com by xroche on 14 Apr 2014 at 6:45

get links error from chinese webpages

What steps will reproduce the problem?
1. mirror https://tw.money.yahoo.com/international-news
2. original url ex:"/美股指數期貨最新報價-13-37-060749261.html"
3. httrack get link 
ex:"https://tw.money.yahoo.com/ގ股指數期貨最新報價-15-28-075053946.htm
l"

What is the expected behavior ? What do you get instead?
expected behavior: 
https://tw.money.yahoo.com/美股指數期貨最新報價-15-28-075053946.html 
=>correct url for download
error: get 404 error for wrong url

What version of httrack are you using? On what operating system?
3.48.13




Original issue reported on code.google.com by [email protected] on 8 Jul 2014 at 8:29

Encoding inconsistencies

1- Info
Choose a random link from
http://sistemasinter.cetesb.sp.gov.br/produtos/produto_consulta_completa.asp
that have any special characters (é, ó, ô, ê...) such, for example, 
"ACETALDEÍDO" (first one started with A)
in some browsers you will get the expected page, but on some you will get one 
with a 500 error.

Got corrected page on Chrome 27.0.1453.94 m @ Win7Home Premium SP1 64 bits but 
WinHTTrack 3.47-14 on same O.S.

Firefox 21.0 on a WinXP machine generated the same error from httrack, but the 
exactly same version on the current machine (specified in the previous line) 
works well

2- Full error message from page
ADODB.Field error '80020009'

Either BOF or EOF is True, or the current record has been deleted. Requested 
operation requires a current record.

/produtos/ficha_completa1.asp, line 0

Original issue reported on code.google.com by [email protected] on 29 May 2013 at 10:16

Attachments:

Too many "bogus state (incomplete type)" errors

On every run in a given project I'm getting too many "bogus state (incomplete 
type)" errors. And on every run this issue happens on the exactly same files.

It is clear to me that this is a httrack fault, not exactly a server one.

My suggestion is to add a feature to try to resume download for binary files 
with "bogus state (incomplete type)" errors.

Running WinHTTrack Website Copier 3.46 x64 on Win7 SP1.

Original issue reported on code.google.com by [email protected] on 30 Mar 2013 at 4:07

Attachments:

Failure on Windows 2000

What steps will reproduce the problem?
1. Install httrack-3.48.9.exe on Windows 2000
2. Start WinHTTrack.exe

What is the expected behavior ? What do you get instead?
Instead program start the error message dialog:
"The procedure entry point SetDllDirectoryA could not be located in the dynamic 
link library KERNEL32.dll"

What version of httrack are you using? On what operating system?
httrack-3.48.9.exe on Windows 2000

Please provide any additional information below.
Unsatisfied reference to SetDllDirectoryA is found in libhttrack.dll

Original issue reported on code.google.com by [email protected] on 4 Jun 2014 at 3:50

OS X: let httrack check for presence of libssl.dylib for SSL support

What steps will reproduce the problem?
1. install httrack via MacPorts
2. try to crawl https:// sites

Please provide any additional information below.
A link from /usr/lib/libssl.so to /usr/lib/libssl.dylib is a possible workaround


Original issue reported on code.google.com by xroche on 5 Apr 2013 at 4:05

Problem: HTTrack renaming a file that doesn't have extension on server to filenamehtml.html

What steps will reproduce the problem?
1. Point HTTrack to download.crystalbuntu.com

What is the expected output? What do you see instead?
Expecting gptsync file to still be called gptsync, however it gets renamed to 
gptsynchtml.html 

What version of the product are you using? On what operating system?
3.46

Please provide any additional information below.

attempting to use HTTrack to mirror this website, download.crystalbuntu.com and 
it works great except for that one issue.

note that I'm pointing IIS directly at the downloaded files area.

Original issue reported on code.google.com by [email protected] on 25 Feb 2013 at 5:04

Attachments:

httrack 3.48-14 can't use URL list file

What steps will reproduce the problem?
1. HTTRACK use URL list file for mirror web(HTTrack.exe -%L startURL.4016)
2. hts-log.txt show Error:Could not include URL list: startURL.4016


What is the expected behavior ? What do you get instead?
expected behavior: HTTRACK add links from URL list file
Error: Could not include URL list: startURL.4016

What version of httrack are you using? On what operating system?
httrack 3.48-14 for windows xp


Original issue reported on code.google.com by [email protected] on 9 Jul 2014 at 6:06

Wildcard domains in cookies do not match

What steps will reproduce the problem?
1. Login to a site with authetication that uses a cookie where the domain is of 
the form:
.domain.org  TRUE    /path FALSE   0000000000      key value
Note the leading dot, which, afaiu, means it should match domain.org and all 
its subdomains.
2. Export the cookies.txt into the httrack mirror directory
3. Try to mirror the website with --debug-headers to see the cookies

What is the expected behavior ? What do you get instead?
HTTrack is expected to find the cookie in cookies.txt when requestiong URLs 
like http://domain.org/path and appended that cookie to the request. The header 
log shows that the cookie is not appended.

What version of httrack are you using? On what operating system?
CLI version on Debian Wheezy: HTTrack version 3.47-21+libhtsjava.so.2

I think this is because this condition does not hold in this case: 
htsbauth.c:cookie_find: (int) strlen(chk_dom) <= (int) strlen(domain)
Tentative patch against debian source package is attached -- tested with my 
particular website. I didn't read the RFC so perhaps its not what we want, but 
it fixed my issue.


Original issue reported on code.google.com by [email protected] on 14 Jul 2013 at 6:18

Attachments:

Windows naming sanity check

Check name restrictions wrt. "con", "aux", and friends.

http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.as
px

Original issue reported on code.google.com by xroche on 7 Oct 2013 at 6:38

Filenames with "#" aren't getting downloaded

And this isn't a server issue.

If I go directly to the URL in browser I get the same error, but if I click 
from a page on the same server it downloads successfully. Maybe an issue with 
URL refer on httrack side or encoding issue?

20:49:01    381/381 ---M--  404 error 
('Not%20Found') text/html   date:Mon,%2012%20Aug%202013%2023:54:36%20GMT    http://li
cenciamento.ibama.gov.br/Hidreletricas/Belo%20Monte/Outros%20Documentos/Acompanh
amento%20da%20LI_condicionantes/CE%20NE%20469_2011-DS_condicionante%202.4/03-Pro
jeto%20Basico%20Linhas%20de%20Transmissao/LT%2034.5%20KV%20SE%20-%20BM%20-%20PI/
LT-3#L~8.DWG    N:/[Ambiente]/0-IBAMA/Hidr/web/delayed/lt-3#l_8.d.delayed   (from 
http://licenciamento.ibama.gov.br/Hidreletricas/Belo%20Monte/Outros%20Documentos
/Acompanhamento%20da%20LI_condicionantes/CE%20NE%20469_2011-DS_condicionante%202
.4/03-Projeto%20Basico%20Linhas%20de%20Transmissao/LT%2034.5%20KV%20SE%20-%20BM%
20-%20PI/)

Original issue reported on code.google.com by [email protected] on 16 Aug 2013 at 7:20

[Documentation] Version history

What steps will reproduce the problem?
1. Open http://www.httrack.com/history.txt
2. Last update is for 3.46-1
3. 3.47.2 is missing

What is the expected output? What do you see instead?
Expected to see the full change log, but history.txt doesn't seem to have been 
updated.

Please update accordingly.

Original issue reported on code.google.com by [email protected] on 24 Apr 2013 at 12:56

Hide "Could not create temporary reference file"

ISSUE

When I use httrack to fetch a Python script, it warns that it couldn't create a 
temporary reference file.

Whatever a reference file is, the download works without one.

Can you stop the warning message from being displayed?

REPRO

In an empty directory, do

$ httrack -g http://sebsauvage.net/python/html2csv.py

In the output you will see a message like 

    Warning:    Could not create temporary reference file for
    sebsauvage.net/python/html2csv.py

When the download is complete, check that the file is really there:

$ ls -l
-rw-r--r-- 1 sandport sandport 6021 Apr  4  2006 html2csv.py

So the file downloaded just fine.

I'm using HTTrack3.47-21+libhtsjava.so.2 on Xubuntu 13.04 64-bit.

I originally reported this on the forum:
http://forum.httrack.com/readmsg/32477/index.html

Google Code looks like a more appropriate place for bug reports.

EXAMPLE LOG

"""
$ httrack -g <http://sebsauvage.net/python/html2csv.py>
HTTrack3.47-21+libhtsjava.so.2 launched on Mon, 10 Feb 2014 20:47:02 at
<http://sebsauvage.net/python/html2csv.py>
(httrack -g <http://sebsauvage.net/python/html2csv.py> )

Information, Warnings and Errors reported for this mirror:
note:   the hts-log.txt file, and hts-cache folder, may contain sensitive
information,
    such as username/password authentication for websites mirrored in this
project
    do not share these files/folders if you want these information to remain
private

Mirror launched on Mon, 10 Feb 2014 20:47:02 by HTTrack Website
Copier/3.47-21+libhtsjava.so.2 [XR&CO'2013]
mirroring <http://sebsauvage.net/python/html2csv.py> with the wizard help..
20:47:32    Warning:    Could not create temporary reference file for
sebsauvage.net/python/html2csv.py
1/2: sebsauvage.net/python/html2csv.py (6021 bytes) - OK
HTTrack Website Copier/3.47-21 mirror complete in 31 seconds : 1 links
scanned, 1 files written (6021 bytes overall) [6328 bytes received at 204
bytes/sec]
(No errors, 1 warnings, 0 messages)
Done.
Thanks for using HTTrack!
$ ls
html2csv.py
"""

Original issue reported on code.google.com by [email protected] on 12 Feb 2014 at 6:05

Bogus charset because the meta http-equiv tag is placed too far in the html page

What steps will reproduce the problem?
1. Download 
file:///C:/temp/websites/test%20accent/www.bbc.co.uk/dna/mbarchers/NF26939435a56
.html
2. View the page

What is the expected output? What do you see instead?
Accents are buggy and the correct charset is superseded by the original (buggy) 
one in the html source code

Please use labels and text to provide additional information.
Reported by Justme at http://forum.httrack.com/readmsg/30487/index.html

Original issue reported on code.google.com by xroche on 28 Feb 2013 at 2:48

httrack don't download's RAR files

Using the attached configuration file, httrack don't downloaded rar files, 
pointing to the external.html instead

(such as 

external.html?link=http://comitespcj.org.br/images/Download/SC_Dados-Ptos-Intere
sse_22-07-13.rar

external.html?link=http://comitespcj.org.br/images/Download/SC_Vazoes-1930-2012.
rar
)

Issues with .ar domain names?

Original issue reported on code.google.com by [email protected] on 8 Nov 2013 at 2:44

Attachments:

Keeping 0 bytes files and error pages even with proper settings set

Even with "No error pages" and "No external pages" set selected, and with "Do 
not purge old files" unselected, WinHTTrack 3.47-19 keeps those files without 
any purging attempt.

[project 
root]/CETESB/sistemasinter.cetesb.sp.gov.br/emergencia/graf_regiao2.html
is 0 bytes and 
http://sistemasinter.cetesb.sp.gov.br/emergencia/graf_regiao2.html
is a 404 error page

[project 
root]/licenciamento.cetesb.sp.gov.br/legislacao/estadual/decretos/decreto_33499.
html
is a 404 error page

[project 
root]\www.cetesb.sp.gov.br\userfiles\image\mudancasclimaticas\proclima\image\fot
os_eventos\seminario_impactos\images\IMG_9589_jpg_jpg.html.readme
was generated for
[project 
root]/www.cetesb.sp.gov.br/userfiles/image/mudancasclimaticas/proclima/image/fot
os_eventos/seminario_impactos/images/IMG_9589_jpg_jpg.html
, a 404 page error. According to the system timestamp, it was on my HTTP /1.0 
run (changed to "force old 1.0" because lots of binary files were renamed as 
.html), but the previous examples where from my current, HTTP 1.0, run

The size on old.XX and new.XXX differs due to changes on my filters settings.

Original issue reported on code.google.com by [email protected] on 25 Jun 2013 at 7:32

Mishandling of long query strings embedding non-ascii chartacters

What steps will reproduce the problem?
1. Crawl a page with a long query string which include non-ascii characters

What is the expected output? What do you see instead?
The mirror ends abruptly.

Please use labels and text to provide additional information.
http://forum.httrack.com/readmsg/32749/index.html

Original issue reported on code.google.com by xroche on 2 May 2014 at 6:31

assertion failure at htscore.c:244 (len + liensbuf->string_buffer_size < liensbuf->string_buffer_capa)

What steps will reproduce the problem?
n/a

What is the expected behavior ? What do you get instead?
assertion failure at htscore.c:244 (len + liensbuf->string_buffer_size < 
liensbuf->string_buffer_capa)

What version of httrack are you using? On what operating system?
3.48.10

Please provide any additional information below.
http://forum.httrack.com/readmsg/32922/index.html

Original issue reported on code.google.com by xroche on 6 Jun 2014 at 3:47

Pound sign in URL not handled properly?

What steps will reproduce the problem?
1. Add url: "www.tsw-builder.com"
2. Start grab
3. Images are not downloaded

What is the expected behavior ? What do you get instead?

Images should download. Instead, they do not.


What version of httrack are you using? On what operating system?

3.47-27
Windows 7 64-bit


Please provide any additional information below.

Open "www.tsw-builder.com" and click any of the weapon icons in the top left 
corner. A list will appear below. Expand any section by clicking on it. The 
ability icons that appear DO NOT download.

At first I thought it was because the lists do not appear until you select a 
weapon and maybe the images are hidden. So I selected a weapon and copied the 
new URL and input it into httrack. For example: http://www.tsw-builder.com/#15vp

When you do this, an error will appear in the error log:

23:48:37 Error:  "Unable to get server's address: The requested name is valid, 
but no data of the " (-5) after 2 retries at link primary/vp (from 
primary/primary)

Maybe I'm reading it wrong, but it seems like httrack isn't properly handling 
the pound sign (#) in the URL.

Original issue reported on code.google.com by [email protected] on 27 Jul 2013 at 3:50

Image in CSS not parsed

Reported at http://forum.httrack.com/readmsg/30327/30320/index.html

What steps will reproduce the problem?
1. Download glka.co.il/templates/new_default/new_default.css

What is the expected output? What do you see instead?
images/2222.jpg should be detected, and downloaded (but is not)

What version of the product are you using? On what operating system?
3.46

Original issue reported on code.google.com by xroche on 25 Feb 2013 at 7:41

segOutputSize < segSize assertion fails at htscharset.c:993

What steps will reproduce the problem?
1. mirror http://ut.httrack.com/unicode-links/idna_bogus.html

What is the expected behavior ? What do you get instead?
Expected not to crash.

What version of httrack are you using? On what operating system?
3.48.8

Please provide any additional information below.
http://forum.httrack.com/readmsg/32822/index.html

Original issue reported on code.google.com by xroche on 19 May 2014 at 7:11

feature request - possible issue - add jpeg to default scan rule/options

What steps will reproduce the problem?
1. goto set options, scan rules
2. look at the preset rules (with check boxes) only *.jpg is present in the 
first one

What version of httrack are you using? On what operating system?
httrack 3.47-27 , windows 7

what i would like!

is it possible to add *.jpeg in to this list as different software use the 2 
different file extensions for jpeg images and i dont think you software classes 
jpg and jpeg as the same.

thanks

jon


Original issue reported on code.google.com by [email protected] on 23 Jan 2014 at 3:26

bus error on Mac OS X

What steps will reproduce the problem?
1. run "webhttrack" on the command line
2. fill out the forms, press "start" on the page where the radio button "Please 
adjust connection parameters if necessary, then press FINISH to launch the 
mirroring operation." is
3. server crashes

What is the expected output? What do you see instead?

On the command line, I see the following output:

$ webhttrack 
/opt/homebrew/bin/webhttrack(58993): launching /usr/bin/open -W
/opt/homebrew/bin/webhttrack(58993): spawning regular browser..
/opt/homebrew/bin/webhttrack: line 166: 59007 Bus error: 10           
${BINPATH}/htsserver "${DISTPATH}/" path "${HOME}/websites" lang "${LANGN}" $@


What version of the product are you using? On what operating system?

HTTrack version 3.47 (.11)

Mac OS X

Darwin vienna.local 12.3.0 Darwin Kernel Version 12.3.0: Sun Jan  6 22:37:10 
PST 2013; root:xnu-2050.22.13~1/RELEASE_X86_64 x86_64


Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 5 May 2013 at 8:18

Bogus state: file larger than expected

This is certainly a server-side issue, but there isn't any kind of action to 
get those files even with this issue? Maybe an advanced option to force to save 
those files?

Running WinHTTrack Website Copier 3.47-27 on Win7 x64

20:05:51 Warning:  file not stored in cache due to bogus state (broken size, 
expected 40960 got 40962): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=1
20:05:52 Warning:  file not stored in cache due to bogus state (broken size, 
expected 25088 got 25090): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=2
20:05:52 Warning:  file not stored in cache due to bogus state (broken size, 
expected 27136 got 27138): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=3
20:05:52 Warning:  file not stored in cache due to bogus state (broken size, 
expected 34816 got 34818): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=4
20:05:53 Warning:  file not stored in cache due to bogus state (broken size, 
expected 32768 got 32770): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=5

20:07:45 Warning:  file not stored in cache due to bogus state (broken size, 
expected 333102 got 333104): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=1102
20:07:47 Warning:  file not stored in cache due to bogus state (broken size, 
expected 60129 got 60131): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=1092
20:08:13 Warning:  file not stored in cache due to bogus state (broken size, 
expected 58789 got 58791): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=892
20:08:18 Warning:  file not stored in cache due to bogus state (broken size, 
expected 72731 got 72733): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=693
20:08:19 Warning:  file not stored in cache due to bogus state (broken size, 
expected 66461 got 66463): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=695
20:14:00 Warning:  file not stored in cache due to bogus state (broken size, 
expected 33930 got 33932): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=1237
20:14:00 Warning:  file not stored in cache due to bogus state (broken size, 
expected 27734 got 27736): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=1238
20:14:01 Warning:  file not stored in cache due to bogus state (broken size, 
expected 135521 got 135523): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=1098
20:14:01 Warning:  file not stored in cache due to bogus state (broken size, 
expected 104016 got 104018): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=1100
20:14:02 Warning:  file not stored in cache due to bogus state (broken size, 
expected 113775 got 113777): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=1101
20:14:03 Warning:  file not stored in cache due to bogus state (broken size, 
expected 115238 got 115240): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=1099
20:15:32 Warning:  file not stored in cache due to bogus state (broken size, 
expected 646567 got 646569): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=883
20:16:05 Warning:  file not stored in cache due to bogus state (broken size, 
expected 105845 got 105847): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=698
20:16:12 Warning:  file not stored in cache due to bogus state (broken size, 
expected 132168 got 132170): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=600
20:16:12 Warning:  file not stored in cache due to bogus state (broken size, 
expected 849282 got 849284): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=601
20:16:15 Warning:  file not stored in cache due to bogus state (broken size, 
expected 294110 got 294112): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=598
20:16:15 Warning:  file not stored in cache due to bogus state (broken size, 
expected 1221725 got 1221727): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=599
20:16:21 Warning:  file not stored in cache due to bogus state (broken size, 
expected 266148 got 266150): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=557
20:16:22 Warning:  file not stored in cache due to bogus state (broken size, 
expected 42635 got 42637): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=558
20:16:22 Warning:  file not stored in cache due to bogus state (broken size, 
expected 94748 got 94750): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=556
20:16:23 Warning:  file not stored in cache due to bogus state (broken size, 
expected 73377 got 73379): 
www.comiteps.sp.gov.br/erapido/plugins/erapido.link/download.php?id=40


Original issue reported on code.google.com by [email protected] on 11 Oct 2013 at 11:24

failure to download

whenever i am trying to download: http://www.computerhope.com/

a dialog box appears saying following:

**MIRROR ERROR!**
HTTrack has detected that the curent mirror is empty. if it was an
update,the previos mirror has been restored.
reason: the first page(s) either could not found, or a connection
problem occured.
=> Ensure that the website stil exits, and/or check your proxy settings
<=

i am using 3.47-27 version of httrack in chrome browser on windows 8.


one additional information.......

log file says following.....


HTTrack3.47-27+htsswf+htsjava launched on Wed, 05 Feb 2014 11:58:17 at 
http://www.computerhope.com/ +*.png +*.gif +*.jpg +*.css +*.js 
-ad.doubleclick.net/* -mime:application/foobar
(winhttrack -qwC2%Ps2u1%s%uN0%I0p3DaK0H0%kf2A25000%f#f -F "Mozilla/4.5 
(compatible; HTTrack 3.0x; Windows 98)" -%F "<!-- Mirrored from %s%s by HTTrack 
Website Copier/3.x [XR&CO'2013], %s -->" -%l "en, *" 
http://www.computerhope.com/ -O1 "C:\My Web Sites\aboutcomputer" +*.png +*.gif 
+*.jpg +*.css +*.js -ad.doubleclick.net/* -mime:application/foobar )
Information, Warnings and Errors reported for this mirror:
note: the hts-log.txt file, and hts-cache folder, may contain sensitive 
information,
 such as username/password authentication for websites mirrored in this project
 do not share these files/folders if you want these information to remain private
11:58:18 Error:  "Forbidden" (403) at link www.computerhope.com/ (from 
primary/primary)
11:58:18 Warning:  No data seems to have been transferred during this session! 
: restoring previous one!


Original issue reported on code.google.com by [email protected] on 5 Feb 2014 at 6:45

Bogus charset on disk when filenames have non-ascii characters

What steps will reproduce the problem?
1. Download a site with non-ascii filenames (such as BIG5 at 
http://fms.cto.doh.gov.tw/DOH/Office/procquerypopularize)

What is the expected output? What do you see instead?
Filenames are expected to be correctly encoded

What version of the product are you using? On what operating system?
3.47-12

Please provide any additional information below.
Reported by Steven Hsiao (http://forum.httrack.com/readmsg/31050/index.html)

Original issue reported on code.google.com by xroche on 18 May 2013 at 3:00

bogus state (broken size, expected NNN got 0)

What steps will reproduce the problem?
1. Download a site providing a "Content-Range: bytes 0-NNN/NNN" header with a 
200 code

What is the expected output? What do you see instead?
The file can not be downloaded, with the error:
"bogus state (broken size, expected NNN got 0)"

What version of the product are you using? On what operating system?
3.47-1

Original issue reported on code.google.com by xroche on 14 Apr 2013 at 5:50

Do not rename ".htm" into ".html"

What steps will reproduce the problem?
1. Download ".htm" URL

What is the expected output? What do you see instead?
.htm files are expected, but httrack renamed files into ".html"

See http://forum.httrack.com/readmsg/31839/index.html

Original issue reported on code.google.com by xroche on 15 Sep 2013 at 11:05

Download fails when site structure is "%h/%p/%N"

SUMMARY
-------

Some sites fail to download when you set the structure type to "%h/%p/%N".

I tested the behavior with ubuntuforums.org and red-gate.com/messageboards.


REPRO FOR UBUNTUFORUMS.ORG
--------------------------

1. Download ubuntofurums.org starting from an arbitrary page. Set site 
structure to "%h/%p/%N" and enable debug logging.

"""
$ httrack 'http://ubuntuforums.org/showthread.php?t=1903782' -N "%h/%p/%N" -Z
"""

httrack exists quickly.


2. Inspect hts-log.txt to find some error messages.

"""
19:54:07    Info:   engine: transfer-status: link error (-1, 'Error when 
decompressing'): ubuntuforums.org/showthread.php?t=1903782
19:54:07    Debug:  File checked by cache: ubuntuforums.org
19:54:07    Info:   engine: warning: serialize error for 
ubuntuforums.org/showthread.php?t=1903782 to /showthreadhtml.tmp: open error 
(directory exists, file does not exist): Permission denied
19:54:07    Info:   engine: warning: serialize error for 
ubuntuforums.org/showthread.php?t=1903782 to /showthreadhtml.tmp: open error 
(directory exists, file does not exist): Permission denied
19:54:07    Info:   engine: warning: serialize error for 
ubuntuforums.org/showthread.php?t=1903782 to /showthreadhtml.tmp: open error 
(directory exists, file does not exist): Permission denied
19:54:07    Debug:  File confirmed (size test): ubuntuforums.org/robots.txt (0)
19:54:07    Info:   engine: warning: serialize error for 
ubuntuforums.org/showthread.php?t=1903782 to /showthreadhtml.tmp: open error 
(directory exists, file does not exist): Permission denied
"""

There is one more error at the end.

"""
19:54:07    Error:  "Error when decompressing" (-1) at link 
ubuntuforums.org/showthread.php?t=1903782 (from primary/primary)
19:54:07    Info:   No data seems to have been transferred during this session! : 
restoring previous one!
19:54:07    Info:   engine: end
19:54:07    Debug:  engine: free
"""

The complete log is attached as hts-log_with_error.txt


EXPECTED BEHAVIOR
-----------------

I expect httrack to download the files and save them in the specified structure:

- a folder called ubuntuforums.org
- a series of folders for the path
- files named showthread.php-1, showthread.php-2, showthread.php-3, etc


SYSTEM DETAILS
--------------

I am using httrack 3.47-21+libhtsjava.so.2 on Xubuntu 13.04 64-bit.

Original issue reported on code.google.com by [email protected] on 13 Feb 2014 at 8:44

Attachments:

2 downstream bugs

I'm sorry I think I've requested too many today. But I think only here can help 
me.

We now have 2 bugs reported in RH bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=923880
https://bugzilla.redhat.com/show_bug.cgi?id=995206

Please help take a look if possible.

Thanks!

Original issue reported on code.google.com by Cickumqt on 13 Sep 2013 at 4:55

Package check for the latest version

I just checked httrack thoroughly, found some issues:

1. httrack-devel.i686: E: incorrect-fsf-address /usr/include/httrack/htsglobal.h
httrack-devel.i686: E: incorrect-fsf-address /usr/include/httrack/htsbasenet.h
httrack-devel.i686: E: incorrect-fsf-address /usr/include/httrack/htsmodules.h
httrack-devel.i686: E: incorrect-fsf-address /usr/include/httrack/htsdefines.h
httrack-devel.i686: E: incorrect-fsf-address 
/usr/include/httrack/httrack-library.h
httrack-devel.i686: E: incorrect-fsf-address /usr/include/httrack/htsbauth.h
httrack-devel.i686: E: incorrect-fsf-address /usr/include/httrack/htsconfig.h
httrack-devel.i686: E: incorrect-fsf-address /usr/include/httrack/htswrap.h
httrack-devel.i686: E: incorrect-fsf-address /usr/include/httrack/htsopt.h
httrack.i686: E: incorrect-fsf-address /usr/share/doc/httrack/license.txt

so I hope you can update the license. FSF changed its address long time ago.

2. httrack.i686: E: missing-call-to-setgroups /usr/lib/libhttrack.so.2.0.47
httrack.i686: E: missing-call-to-chdir-with-chroot /usr/lib/libhttrack.so.2.0.47

Seems your coding style is not recommended by nist. 
This executable is calling setuid and setgid without setgroups or initgroups.
There is a high probability this mean it didn't relinquish all groups, and this
would be a potential security issue to be fixed. Seek POS36-C on the web for
details about the problem.

Ref POS36-C:

https://www.securecoding.cert.org/confluence/display/seccode/POS36-C.+Observe+co
rrect+revocation+order+while+relinquishing+privileges


3.
httrack.i686: W: spurious-executable-perm /usr/share/doc/httrack/AUTHORS

Seems it should be 644 only.

4. We've found obsoleted m4 macros in your package, see:

https://fedorahosted.org/FedoraReview/wiki/AutoTools

5. I found 2 folders:

src/minizip and src/mmsrip. Seems they are bundled libs? I just took this 
pacakge over in Fedora, I don't know when you added them, but due to policy 
mmsrip is not accepted in Fedora:

https://bugzilla.redhat.com/show_bug.cgi?id=219112

And minizip is Ok. But I'm not sure if I can unbundle it or not.

Original issue reported on code.google.com by Cickumqt on 13 Sep 2013 at 1:19

webhttrack sometimes starts a navigation leading to "unable to connect to foo:8080"

What steps will reproduce the problem?
1. start webhttrack on a local machine where the hostname is unresolvable

What is the expected behavior ? What do you get instead?
webhttrack should fallback to localhost

What version of httrack are you using? On what operating system?
3.47


Original issue reported on code.google.com by xroche on 26 May 2013 at 8:28

filenames containing "+" are not downloaded

Files that contain "+" in the name give "Not Found" error.
The version of httrack that gives this error is 3.47.20, instead in version 
3.46.1 this problem does not appear.
Please see the attachment, which contains two simplified and identical projects 
made with the two versions of httrack.
Read ReadMe.txt that gives some more explanation.
My operating system is Windows 7.

Original issue reported on code.google.com by [email protected] on 4 Jul 2013 at 9:05

Attachments:

Debian package: clean target does not completely clean it

What steps will reproduce the problem?
1. Build: debuild -uc -us
2. Clean: debuild clean
3. Build again: debuild -uc -us

What is the expected behavior ? What do you get instead?
Expected is the second build to succeed, but it fails with:
dpkg-source: info: local changes detected, the modified files are:
 httrack-3.47.21/Makefile.in
 httrack-3.47.21/aclocal.m4
 httrack-3.47.21/configure
 httrack-3.47.21/html/Makefile.in
 httrack-3.47.21/lang/Makefile.in
 httrack-3.47.21/libtest/Makefile.in
 httrack-3.47.21/m4/Makefile.in
 httrack-3.47.21/man/Makefile.in
 httrack-3.47.21/src/Makefile.in
 httrack-3.47.21/templates/Makefile.in
 httrack-3.47.21/tests/Makefile.in
 httrack-3.47.21/tests/check-network_sh.cache
dpkg-source: error: aborting due to unexpected upstream changes, see 
/tmp/httrack_3.47.21-1ac1.diff.BOw0fx
dpkg-source: info: you can integrate the local changes with dpkg-source --commit
dpkg-buildpackage: error: dpkg-source -b httrack-3.47.21 gave error exit status 
2

Manually removing all those autogenerated files is a workaround. Probably there 
is some automake foo that will delete them, other than perhaps the one 
byproduct of tests.

What version of httrack are you using? On what operating system?
Debian source package httrack 3.47.21-1


Original issue reported on code.google.com by [email protected] on 14 Jul 2013 at 6:49

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.