Giter Club home page Giter Club logo

cld2's People

Watchers

 avatar

cld2's Issues

compact_lang_det.h: loadDataFromRawAddress should use types from stdint.h instead of "int"

The existing signature of loadDataFromRawAddress:
void loadDataFromRawAddress(const void* rawAddress, const int length);

The use of "int" here is dangerous because we don't know what the length will 
be on any platform. This is my fault, since I'm the one who introduced this 
API. Before much more time elapses, we should use a type from stdint.h instead. 
In this case I think uint32_t would make the most sense, as we need more than 
16 bits for sure but more than 32 would be truly insane.

It's a simple patch; any objections?

Original issue reported on code.google.com by [email protected] on 26 Mar 2014 at 11:56

Post-dynamic-mode cleanup

There are a few things to clean up in r151:
* Use the newly-added constants in the table classes to avoid hardcoding sizes
* Ensure cld2_generated_quadchrome0122_16.cc works with both active tables in 
dynamic mode
* Add the ability to use an already-extant mmap to load the data from (rather 
than managing the mmap directly). This is necessary for systems (such as 
Chromium) where the security model forbids direct access to the filesystem in 
some contexts where CLD2 might be used

Should all be pretty straightforward. Remove all FIXME and TODO comments added 
by [email protected] as well.

Original issue reported on code.google.com by [email protected] on 3 Mar 2014 at 3:25

New GCC 5.0 hits problem with narrowing in list-initializers

Following errors are produced by GCC compiler:

c++ -MMD -MF 
obj/third_party/cld_2/src/internal/cld2_static.cld_generated_cjk_uni_prop_80.o.d
 -DV8_DEPRECATION_WARNINGS -D_FILE_OFFSET_BITS=64 -DCHROMIUM_BUILD 
-DTOOLKIT_VIEWS=1 -DUI_COMPOSITOR_IMAGE_TRANSPORT -DUSE_AURA=1 -DUSE_ASH=1 
-DUSE_PANGO=1 -DUSE_CAIRO=1 -DUSE_DEFAULT_RENDER_THEME=1 -DUSE_LIBJPEG_TURBO=1 
-DUSE_X11=1 -DUSE_CLIPBOARD_AURAX11=1 -DENABLE_ONE_CLICK_SIGNIN 
-DENABLE_PRE_SYNC_BACKUP -DENABLE_REMOTING=1 -DENABLE_WEBRTC=1 
-DENABLE_PEPPER_CDMS -DENABLE_CONFIGURATION_POLICY -DENABLE_NOTIFICATIONS 
-DUSE_UDEV -DDONT_EMBED_BUILD_METADATA -DENABLE_TASK_MANAGER=1 
-DENABLE_EXTENSIONS=1 -DENABLE_PLUGINS=1 -DENABLE_SESSION_SERVICE=1 
-DENABLE_THEMES=1 -DENABLE_AUTOFILL_DIALOG=1 -DENABLE_BACKGROUND=1 
-DENABLE_GOOGLE_NOW=1 -DCLD_VERSION=2 -DENABLE_PRINTING=1 
-DENABLE_BASIC_PRINTING=1 -DENABLE_PRINT_PREVIEW=1 -DENABLE_SPELLCHECK=1 
-DENABLE_CAPTIVE_PORTAL_DETECTION=1 -DENABLE_APP_LIST=1 -DENABLE_SETTINGS_APP=1 
-DENABLE_SUPERVISED_USERS=1 -DENABLE_MDNS=1 -DENABLE_SERVICE_DISCOVERY=1 
-DV8_USE_EXTERNAL_STARTUP_DATA -DUSE_LIBPCI=1 -DUSE_GLIB=1 -DUSE_NSS=1 -DNDEBUG 
-DNVALGRIND -DDYNAMIC_ANNOTATIONS_ENABLED=0 -Igen 
-I../../third_party/cld_2/src/internal -I../../third_party/cld_2/src/public 
-fstack-protector --param=ssp-buffer-size=4  -pthread -fno-strict-aliasing 
-Wno-unused-parameter -Wno-missing-field-initializers -fvisibility=hidden -pipe 
-fPIC 
-B/home/marxin/Programming/chromium/src/third_party/binutils/Linux_x64/Release/b
in -Wno-unused-local-typedefs -Wno-format -Wno-unused-result -m64 -march=x86-64 
-O2 -fno-ident -fdata-sections -ffunction-sections -funwind-tables 
-fno-exceptions -fno-rtti -fno-threadsafe-statics -fvisibility-inlines-hidden 
-Wno-deprecated -std=gnu++11 -Wno-narrowing -Wno-literal-suffix  -c 
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc -o 
obj/third_party/cld_2/src/internal/cld2_static.cld_generated_cjk_uni_prop_80.o 
-Wno-narrowing
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
 };
 ^
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: 
error: narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka 
unsigned char}’ inside { }
... (and many more)

Problem is more discussed in following thread: 
https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/D5YxoMmtEmE
I think fix is quite obvious, generator should produce just uint8 numbers.

Thanks,
Martin

Original issue reported on code.google.com by [email protected] on 5 Jan 2015 at 10:34

Windows compile process for Chromium unhappy with zero-length array declarations

In these files (and any others, obviously):
cld2_generated_distinctoctachrome0122
cld2_generated_deltaoctachrome0122

The Windows compile chain for Chromium is upset because there is an attempt to 
declare a zero-length array. Dick has noted this as a concern when we let the 
size be zero, and it seems the concern is valid under the Chromium build chain 
on Windows.

From Chromium's buildbots, here are the error messages from compilation:

FAILED: ninja -t msvc -e environment.x86 -- E:\b\build\goma\gomacc.exe 
"E:\b\depot_tools\win_toolchain\vs2013_files\VC\bin\amd64_x86\cl.exe" /nologo 
/showIncludes /FC 
@obj\third_party\cld_2\src\internal\cld_2.cld2_generated_distinctoctachrome0122.
obj.rsp /c 
..\..\third_party\cld_2\src\internal\cld2_generated_distinctoctachrome0122.cc 
/Foobj\third_party\cld_2\src\internal\cld_2.cld2_generated_distinctoctachrome012
2.obj /Fdobj\third_party\cld_2\cld_2.cc.pdb 
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_generated_dis
tinctoctachrome0122.cc(2184) : error C2466: cannot allocate an array of 
constant size 0
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_generated_dis
tinctoctachrome0122.cc(2186) : error C2466: cannot allocate an array of 
constant size 0
FAILED: ninja -t msvc -e environment.x86 -- E:\b\build\goma\gomacc.exe 
"E:\b\depot_tools\win_toolchain\vs2013_files\VC\bin\amd64_x86\cl.exe" /nologo 
/showIncludes /FC 
@obj\third_party\cld_2\src\internal\cld_2.cld2_generated_deltaoctachrome0122.obj
.rsp /c 
..\..\third_party\cld_2\src\internal\cld2_generated_deltaoctachrome0122.cc 
/Foobj\third_party\cld_2\src\internal\cld_2.cld2_generated_deltaoctachrome0122.o
bj /Fdobj\third_party\cld_2\cld_2.cc.pdb 
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_generated_del
taoctachrome0122.cc(4577) : error C2466: cannot allocate an array of constant 
size 0
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_generated_del
taoctachrome0122.cc(4579) : error C2466: cannot allocate an array of constant 
size 0
ninja: build stopped: subcommand failed.

The workaround we had in place before was to have the constants for size *say* 
zero, i.e. the code will never read anything from the array and the dynamic 
data tool will just skip it. We'd then actually allocate an array of size one 
(however many bytes, usually 4 for our use cases of uint32). This makes the 
compiler happy at a cost of a few bytes of overhead in non-dynamic mode. Seems 
like we don't really have a choice here, so I'll prepare the patch.

Original issue reported on code.google.com by [email protected] on 12 Mar 2014 at 11:32

Valgrind errors?

Hi, thanks for the awesome library. I'm seeing a couple memory errors in 
valgrind when I use it.

The first:

==7805== Conditional jump or move depends on uninitialised value(s)
==7805==    at 0x4C2CB94: strcmp (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==7805==    by 0x43C412: CLD2::DoTLDLookup(char const*, CLD2::TLDLookup const*, 
int) (compact_lang_det_hint_code.cc:1034)
==7805==    by 0x43D705: CLD2::SetCLDTLDHint(char const*, CLD2::CLDLangPriors*) 
(compact_lang_det_hint_code.cc:1452)
==7805==    by 0x40CEB0: CLD2::ApplyHints(char const*, int, bool, 
CLD2::CLDHints const*, CLD2::ScoringContext*) (compact_lang_det_impl.cc:1504)
==7805==    by 0x40DC4F: CLD2::DetectLanguageSummaryV2(char const*, int, bool, 
CLD2::CLDHints const*, bool, int, CLD2::Language, CLD2::Language*, int*, 
double*, std::__1::vector<CLD2::ResultChunk, 
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*) 
(compact_lang_det_impl.cc:1644)
==7805==    by 0x409BAE: CLD2::DetectLanguageSummary(char const*, int, bool, 
char const*, int, CLD2::Language, CLD2::Language*, int*, int*, bool*) 
(compact_lang_det.cc:133)
==7805==    by 0x405932: codulus::main(int, char**) 
(test_language_detection.cc:43)
==7805==    by 0x406341: main (test_language_detection.cc:64)
==7805== 

This one seems reasonable to me, DoTLDLookup is using strcmp, but the value of 
'key' passed to it is not null terminated.


The other issue I see is an invalid read of one character past the end of my 
input in a couple places in the code:

==8337== Invalid read of size 1
==8337==    at 0x415932: CLD2::ScriptScanner::GetOneScriptSpan(CLD2::LangSpan*) 
(getonescriptspan.cc:973)
==8337==    by 0x415DAE: 
CLD2::ScriptScanner::GetOneScriptSpanLower(CLD2::LangSpan*) 
(getonescriptspan.cc:1074)
==8337==    by 0x40DCE9: CLD2::DetectLanguageSummaryV2(char const*, int, bool, 
CLD2::CLDHints const*, bool, int, CLD2::Language, CLD2::Language*, int*, 
double*, std::__1::vector<CLD2::ResultChunk, 
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*) 
(compact_lang_det_impl.cc:1707)
==8337==    by 0x40991E: CLD2::DetectLanguageSummary(char const*, int, bool, 
char const*, int, CLD2::Language, CLD2::Language*, int*, int*, bool*) 
(compact_lang_det.cc:133)
==8337==    by 0x405869: codulus::main(int, char**) 
(test_language_detection.cc:42)
==8337==    by 0x4060B1: main (test_language_detection.cc:63)

==8337== Invalid read of size 1
==8337==    at 0x414D3C: CLD2::UTF8OneCharLen(char const*) 
(utf8statetable.h:270)
==8337==    by 0x415A6D: CLD2::ScriptScanner::GetOneScriptSpan(CLD2::LangSpan*) 
(getonescriptspan.cc:991)
==8337==    by 0x415DAE: 
CLD2::ScriptScanner::GetOneScriptSpanLower(CLD2::LangSpan*) 
(getonescriptspan.cc:1074)
==8337==    by 0x40DCE9: CLD2::DetectLanguageSummaryV2(char const*, int, bool, 
CLD2::CLDHints const*, bool, int, CLD2::Language, CLD2::Language*, int*, 
double*, std::__1::vector<CLD2::ResultChunk, 
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*) 
(compact_lang_det_impl.cc:1707)
==8337==    by 0x40991E: CLD2::DetectLanguageSummary(char const*, int, bool, 
char const*, int, CLD2::Language, CLD2::Language*, int*, int*, bool*) 
(compact_lang_det.cc:133)
==8337==    by 0x405869: codulus::main(int, char**) 
(test_language_detection.cc:42)
==8337==    by 0x4060B1: main (test_language_detection.cc:63)

==8337== Invalid read of size 1
==8337==    at 0x41D1A3: 
CLD2::UTF8GenericPropertyTwoByte(CLD2::UTF8StateMachineObj_2 const*, unsigned 
char const**, int*) (utf8statetable.cc:403)
==8337==    by 0x414D24: CLD2::GetUTF8LetterScriptNum(char const*) 
(getonescriptspan.cc:1098)
==8337==    by 0x415A87: CLD2::ScriptScanner::GetOneScriptSpan(CLD2::LangSpan*) 
(getonescriptspan.cc:992)
==8337==    by 0x415DAE: 
CLD2::ScriptScanner::GetOneScriptSpanLower(CLD2::LangSpan*) 
(getonescriptspan.cc:1074)
==8337==    by 0x40DCE9: CLD2::DetectLanguageSummaryV2(char const*, int, bool, 
CLD2::CLDHints const*, bool, int, CLD2::Language, CLD2::Language*, int*, 
double*, std::__1::vector<CLD2::ResultChunk, 
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*) 
(compact_lang_det_impl.cc:1707)
==8337==    by 0x40991E: CLD2::DetectLanguageSummary(char const*, int, bool, 
char const*, int, CLD2::Language, CLD2::Language*, int*, int*, bool*) 
(compact_lang_det.cc:133)
==8337==    by 0x405869: codulus::main(int, char**) 
(test_language_detection.cc:42)
==8337==    by 0x4060B1: main (test_language_detection.cc:63)

For now, I'm working around this by passing (input, size - 1) instead of 
(input, size) to cld2. My input is not null terminated, if that makes a 
difference. It seems to happen with every input I try (they are all web pages, 
by the way). Also, I am running this on x64 linux.

Any ideas?

Original issue reported on code.google.com by [email protected] on 21 Aug 2013 at 6:04

compile_libs.sh does not work on Windows 7 x64 with cygwin

What steps will reproduce the problem?
1. sh compile_libs.sh
2. Observe errors

What is the expected output? What do you see instead?
Expected: None (success)
Actual output:

compact_lang_det_test.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
compact_lang_det_test.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cldutil.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
cldutil.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cldutil_shared.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
cldutil_shared.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
compact_lang_det.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
compact_lang_det.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
compact_lang_det_hint_code.cc:1:0: warning: -fPIC ignored for target (all code 
is position independent) [enabled by default]
compact_lang_det_hint_code.cc:1:0: sorry, unimplemented: 64-bit mode not 
compiled in
compact_lang_det_impl.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
compact_lang_det_impl.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
debug.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
debug.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
fixunicodevalue.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
fixunicodevalue.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
generated_entities.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
generated_entities.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
generated_language.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
generated_language.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
generated_ulscript.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
generated_ulscript.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
getonescriptspan.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
getonescriptspan.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
lang_script.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
lang_script.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
offsetmap.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
offsetmap.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
scoreonescriptspan.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
scoreonescriptspan.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
tote.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
tote.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
utf8statetable.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
utf8statetable.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cld_generated_cjk_uni_prop_80.cc:1:0: warning: -fPIC ignored for target (all 
code is position independent) [enabled by default]
cld_generated_cjk_uni_prop_80.cc:1:0: sorry, unimplemented: 64-bit mode not 
compiled in
cld2_generated_cjk_compatible.cc:1:0: warning: -fPIC ignored for target (all 
code is position independent) [enabled by default]
cld2_generated_cjk_compatible.cc:1:0: sorry, unimplemented: 64-bit mode not 
compiled in
cld_generated_cjk_delta_bi_4.cc:1:0: warning: -fPIC ignored for target (all 
code is position independent) [enabled by default]
cld_generated_cjk_delta_bi_4.cc:1:0: sorry, unimplemented: 64-bit mode not 
compiled in
generated_distinct_bi_0.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
generated_distinct_bi_0.cc:1:0: sorry, unimplemented: 64-bit mode not compiled 
in
cld2_generated_quadchrome0715.cc:1:0: warning: -fPIC ignored for target (all 
code is position independent) [enabled by default]
cld2_generated_quadchrome0715.cc:1:0: sorry, unimplemented: 64-bit mode not 
compiled in
cld2_generated_deltaoctachrome0614.cc:1:0: warning: -fPIC ignored for target 
(all code is position independent) [enabled by default]
cld2_generated_deltaoctachrome0614.cc:1:0: sorry, unimplemented: 64-bit mode 
not compiled in
cld2_generated_distinctoctachrome0604.cc:1:0: warning: -fPIC ignored for target 
(all code is position independent) [enabled by default]
cld2_generated_distinctoctachrome0604.cc:1:0: sorry, unimplemented: 64-bit mode 
not compiled in
cld_generated_score_quad_octa_1024_256.cc:1:0: warning: -fPIC ignored for 
target (all code is position independent) [enabled by default]
cld_generated_score_quad_octa_1024_256.cc:1:0: sorry, unimplemented: 64-bit 
mode not compiled in
compact_lang_det_test.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
compact_lang_det_test.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cldutil.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
cldutil.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cldutil_shared.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
cldutil_shared.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
compact_lang_det.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
compact_lang_det.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
compact_lang_det_hint_code.cc:1:0: warning: -fPIC ignored for target (all code 
is position independent) [enabled by default]
compact_lang_det_hint_code.cc:1:0: sorry, unimplemented: 64-bit mode not 
compiled in
compact_lang_det_impl.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
compact_lang_det_impl.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
debug.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
debug.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
fixunicodevalue.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
fixunicodevalue.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
generated_entities.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
generated_entities.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
generated_language.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
generated_language.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
generated_ulscript.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
generated_ulscript.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
getonescriptspan.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
getonescriptspan.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
lang_script.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
lang_script.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
offsetmap.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
offsetmap.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
scoreonescriptspan.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
scoreonescriptspan.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
tote.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
tote.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
utf8statetable.cc:1:0: warning: -fPIC ignored for target (all code is position 
independent) [enabled by default]
utf8statetable.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cld_generated_cjk_uni_prop_80.cc:1:0: warning: -fPIC ignored for target (all 
code is position independent) [enabled by default]
cld_generated_cjk_uni_prop_80.cc:1:0: sorry, unimplemented: 64-bit mode not 
compiled in
cld2_generated_cjk_compatible.cc:1:0: warning: -fPIC ignored for target (all 
code is position independent) [enabled by default]
cld2_generated_cjk_compatible.cc:1:0: sorry, unimplemented: 64-bit mode not 
compiled in
cld_generated_cjk_delta_bi_32.cc:1:0: warning: -fPIC ignored for target (all 
code is position independent) [enabled by default]
cld_generated_cjk_delta_bi_32.cc:1:0: sorry, unimplemented: 64-bit mode not 
compiled in
generated_distinct_bi_0.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
generated_distinct_bi_0.cc:1:0: sorry, unimplemented: 64-bit mode not compiled 
in
cld2_generated_quad0720.cc:1:0: warning: -fPIC ignored for target (all code is 
position independent) [enabled by default]
cld2_generated_quad0720.cc:1:0: sorry, unimplemented: 64-bit mode not compiled 
in
cld2_generated_deltaocta0527.cc:1:0: warning: -fPIC ignored for target (all 
code is position independent) [enabled by default]
cld2_generated_deltaocta0527.cc:1:0: sorry, unimplemented: 64-bit mode not 
compiled in
cld2_generated_distinctocta0527.cc:1:0: warning: -fPIC ignored for target (all 
code is position independent) [enabled by default]
cld2_generated_distinctocta0527.cc:1:0: sorry, unimplemented: 64-bit mode not 
compiled in
cld_generated_score_quad_octa_1024_256.cc:1:0: warning: -fPIC ignored for 
target (all code is position independent) [enabled by default]
cld_generated_score_quad_octa_1024_256.cc:1:0: sorry, unimplemented: 64-bit 
mode not compiled in

What version of the product are you using? On what operating system?
Windows 7 x64 SP1
gcc version 4.7.3 (GCC) (i686-pc-cygwin)
GNU bash, version 4.1.10(4)-release (i686-pc-cygwin)

Please provide any additional information below.

I'm tryin to build this library on a Windows host to be used in the 
chromium-compact-language-detector Python extension.

When I remove the flags -fPIC and -m64 the compilation works (but that is 
probably not the right fix). And I can't test it because the Python extension 
requires *.lib files but *.so are produced.

Original issue reported on code.google.com by [email protected] on 9 Sep 2013 at 2:22

CLD2 result chunk vector omits portions of input file

Hello,

I'm trying to extract natural language from a web crawl for use in NLP 
applications. Since web pages often have multiple languages on them, I'm using 
CLD2's ResultChunkVector API to split each page into chunks of known uniform 
language. The problem I'm running into is that fairly often, the 
ResultChunkVector simply doesn't include parts of the input text -- I've 
attached two sample files that demonstrate this. In 32200.utf8, the first chunk 
starts at position 8 -- I guess this has something to do with the fact that the 
file starts with numbers/punctuation? In 27878255.utf8, the first chunk covers 
positions 0-65530, and the second chunk begins at position 199884 (so there's a 
very substantial amount of text being skipped! and the text appears to be plain 
old English, nothing special) -- I guess this might have something to do with 
the use of a 2-byte length field, but the length of the first chunk isn't 
2**16. And perhaps there are other cases that also lead to gaps like this.

My expectation was that the first chunk would always start at position 0, that 
each chunk would start where the previous one ended, and that the last chunk 
would end at the end of the input file. Or, if this isn't possible, then is 
there any guidance on how gaps like this should be interpreted? I could simply 
pretend they were tagged "unknown", but this seems like a pretty weird way to 
handle the 140 kB of English text in 27878255.utf8.

I'm using the "full" detector, but these files trigger the behaviour in both 
full and regular modes (slightly differently).

Original issue reported on code.google.com by [email protected] on 1 Jul 2014 at 11:04

Attachments:

Missing include in cld2_dynamic_data_loader.cc

What steps will reproduce the problem?
1. Try to compile the cld_2_dynamic_data_tool with gcc 4.8 in Ubuntu 12.04

What is the expected output? What do you see instead?

It fails because close() isn't defined. close() is declared in <unistd.h> and 
adding that include makes it compile.

I could do a patch but I suspects it is much faster for everyone that a 
maintainer just does this manually:

index 7227b8e..06375e18 100644
--- a/third_party/cld_2/src/internal/cld2_dynamic_data_loader.cc
+++ b/third_party/cld_2/src/internal/cld2_dynamic_data_loader.cc
@@ -19,6 +19,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <sys/mman.h>
+#include <unistd.h>

 #include "cld2_dynamic_data.h"
 #include "cld2_dynamic_data_loader.h"


Original issue reported on code.google.com by [email protected] on 29 Aug 2014 at 8:22

  • Merged into: #19

please provide a SONAME

Can you please provide a SONAME for the library?

Installing something in usr/lib without a SONAME is so painful.

Original issue reported on code.google.com by [email protected] on 10 Feb 2015 at 3:37

Build warning on Windows with clang

http://build.chromium.org/p/chromium.fyi/builders/Cr%20Win%20Clang/builds/108/st
eps/compile/logs/stdio

..\..\third_party\cld_2\src\internal\offsetmap.cc(82,43) :  warning(clang): 
format specifies type 'long' but the argument has type 'size_type' (aka 
'unsigned int') [-Wformat]
  fprintf(fout, "Offsetmap: %ld bytes\n", diffs_.size());
                            ~~~           ^~~~~~~~~~~~~
                            %u

There's no great portable way to printf size_t types. Since this is debugging 
code, I suggest this patch:

Nicos-MacBook-Pro:src thakis$ svn diff
Index: internal/offsetmap.cc
===================================================================
--- internal/offsetmap.cc   (revision 165)
+++ internal/offsetmap.cc   (working copy)
@@ -79,7 +79,8 @@
   }

   Flush();    // Make sure any pending entry gets printed
-  fprintf(fout, "Offsetmap: %ld bytes\n", diffs_.size());
+  fprintf(fout, "Offsetmap: %lu bytes\n",
+          static_cast<unsigned long>(diffs_.size()));
   for (int i = 0; i < static_cast<int>(diffs_.size()); ++i) {
     fprintf(fout, "%c%02d ", "&=+-"[OpPart(diffs_[i])], LenPart(diffs_[i]));
     if ((i % 20) == 19) {fprintf(fout, "\n");}

Can you land this, please?

Original issue reported on code.google.com by [email protected] on 18 Aug 2014 at 2:24

Wasted work in cld::GetNormalizedScore() and cld::GetReliability()

The problem appears in revision 215539. I have attached a simple one-line patch 
that fixes it.

In method cld::GetNormalizedScore() in cld/compact_lang_det/cldutil.cc, the 
loop in line 818 keeps overriding "expected_score" with "kMeanScore[cur_lang * 
4 + i]" when it is larger than zero. Therefore, only the last written value is 
visible out of the loop and all the other writes and iterations are not 
necessary. The patch iterates from the end of "i" and breaks the first time 
when "expected_score" is set.

Similar problem also appears in cld::GetReliability(), at line 846.


Original issue reported on code.google.com by [email protected] on 6 Aug 2013 at 8:45

Consider declaring dynamic data methods unconditionally

Today, we guard the declaration of the dynamic-data-related functions in 
comapct_lang_det.h with "#ifdef CLD2_DYNAMIC_MODE":
https://code.google.com/p/cld2/source/browse/trunk/public/compact_lang_det.h

This causes some unfortunate side effects when including CLD2 in another 
project: unless building with a single compile pass including all sources, any 
separate compilation unit that requires dynamic functionality has to have the 
same define when it #includes compact_lang_det.h in order to keep the compiler 
happy.

For example, Chromium builds CLD2 separately, then links it into the Chromium 
binary; but if CLD2_DYNAMIC_MODE isn't defined in Chromium code that includes 
compact_lang_det.h, you get compiler errors like the ones below even if CLD2 
itself has been built with the define:

error: 'isDataLoaded' is not a member of 'CLD2'
error: 'loadDataFromRawAddress' is not a member of 'CLD2'

Ideally, the #define guard can be encapsulated entirely within CLD2 so that the 
dependent library doesn't need to know about this at all.

The downside is that dependent code might accidentally try to use dynamic mode 
even if it isn't available. Throwing exceptions isn't a viable solution, since 
some projects disable exceptions when compiling. We'd presumably just have to 
define the following behavior if CLD2_DYNAMIC_MODE is not defined:

isDataLoaded: return true
loadDataFromRawAddress: no-op and output a warning to stderr
loadDataFromFile: no-op and output a warning to stderr

This change should be fully backwards compatible, since it doesn't change or 
remove any existing function declarations under any circumstances.

Original issue reported on code.google.com by [email protected] on 23 Jun 2014 at 9:57

Compilation issues in Visual Studio

I am trying to compile the chromium in Visual Studio 2013. I am actually trying 
to create a .NET Wrapper for the library so I have added all the source files 
inside my CLR project.

Now whenever I compile I get these linking errors.

    error LNK2005: "struct CLD2::CLD2TableSummary const CLD2::kCjkDeltaBi_obj" (?kCjkDeltaBi_obj@CLD2@@3UCLD2TableSummary@1@B) already defined in cld_generated_cjk_delta_bi_32.obj

These all seems to be related as I can see a relation between the 'generated' 
files.

Problem is I have a lot of these and I am not sure which ones I should exclude 
and which I should keep and use in my code.

Here is a list all the generated files that came with the CLD2 code.

    cld_generated_cjk_uni_prop_80.cc
    cld_generated_score_quad_octa_2.cc
    cld_generated_score_quad_octa_0122.cc
    cld_generated_score_quad_octa_0122_2.cc
    cld_generated_score_quad_octa_1024_256.cc
    cld_generated_cjk_delta_bi_4.cc
    cld_generated_cjk_delta_bi_32.cc
    cld2_generated_octa2_dummy.cc
    cld2_generated_quad0122.cc
    cld2_generated_quad0720.cc
    cld2_generated_quadchrome_2.cc
    cld2_generated_quadchrome_16.cc
    cld2_generated_cjk_compatible.cc
    cld2_generated_deltaocta0122.cc
    cld2_generated_deltaocta0527.cc
    cld2_generated_deltaoctachrome.cc
    cld2_generated_distinctocta0122.cc
    cld2_generated_distinctocta0527.cc
    cld2_generated_distinctoctachrome.cc

The naming convention of these suggests that I should only be using one of each 
group. At least that how I think I should use it as I am not really an expert 
in encoding nor in how CLD2 works. And I could not find any references online 
explaining how to configure it.

I tried eliminating the linking errors by keeping only one of each generated 
group:

for example: from `cld_generated_cjk_delta_bi_4` and 
`cld_generated_cjk_delta_bi_32` I kept the 32 version. And so on for the rest 
of the files.

Now this made CLD compile yet when I tried testing it with languages I noticed 
that the scores were way way off and it was behaving inexplicably bad.

I am not trying to support all languages I only need to support latin languages 
along with hebrew, arabic, japanese and chinese.

Can someone please explain how to configure CLD2 to compile and work correctly.


Original issue reported on code.google.com by [email protected] on 30 Mar 2015 at 5:57

Add armv8-a support

On behalf of [email protected] (cc'd):

--- snip ---
I would like to send a patch to CLD_2 for adding ARMv8a to the supporting list 
in internal/port.h.
It’s my first time to send patches to CLD_2 project, and I have no idea how 
to upload it.

Can you take a look at the attached file to check if this modification is 
useful? Is there any issue related to this modification? Also, can you tell me 
how to upload the patch properly?

Thanks for your kindly help.
--- snip ---

Original issue reported on code.google.com by [email protected] on 5 May 2015 at 9:46

Attachments:

CLD should check result of "new" in all use cases

There are many uses of the "new" operator in the CLD source code, such as in 
scoreonescriptspan.cc's "new ScoringHitBuffer":

https://code.google.com/p/cld2/source/browse/trunk/internal/scoreonescriptspan.c
c#1168

There's no check that the "new" operator successfully allocated memory. In 
low-memory conditions this can lead to an access violation and subsequent crash.

The code should fail gracefully under low-memory conditions, though it isn't 
immediately obvious how to "gracefully" fail or how helpful it would be to the 
caller to have such behavior if they are truly out of memory.

Original issue reported on code.google.com by [email protected] on 7 Jan 2015 at 12:19

please use CFLAGS CXXFLAGS CPPFLAGS and LDFLAGS

patch attached.

Description: Adding CFLAGS CXXFLAGS CPPFLAGS and LDFLAGS to the build
Author: Gianfranco Costamagna <[email protected]>
Origin: debian
Last-Update: <2015-01-10>

--- cld2-0.0.0~svn193.orig/internal/compile.sh
+++ cld2-0.0.0~svn193/internal/compile.sh
@@ -14,7 +14,7 @@
 #  See the License for the specific language governing permissions and
 #  limitations under the License.

-g++ -O2 -m64  compact_lang_det_test.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS compact_lang_det_test.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc \
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -24,10 +24,10 @@ g++ -O2 -m64  compact_lang_det_test.cc \
   cld_generated_cjk_delta_bi_4.cc generated_distinct_bi_0.cc  \
   cld2_generated_quadchrome_2.cc cld2_generated_deltaoctachrome.cc \
   cld2_generated_distinctoctachrome.cc  cld_generated_score_quad_octa_2.cc  \
-  -o compact_lang_det_test_chrome_2
+  -o compact_lang_det_test_chrome_2 $LDFLAGS
 echo "  compact_lang_det_test_chrome_2 compiled"

-g++ -O2 -m64  compact_lang_det_test.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS compact_lang_det_test.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc \
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -37,11 +37,11 @@ g++ -O2 -m64  compact_lang_det_test.cc \
   cld_generated_cjk_delta_bi_4.cc generated_distinct_bi_0.cc  \
   cld2_generated_quadchrome_16.cc cld2_generated_deltaoctachrome.cc \
   cld2_generated_distinctoctachrome.cc  cld_generated_score_quad_octa_2.cc  \
-  -o compact_lang_det_test_chrome_16
+  -o compact_lang_det_test_chrome_16 $LDFLAGS
 echo "  compact_lang_det_test_chrome_16 compiled"


-g++ -O2 -m64  cld2_unittest.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS cld2_unittest.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc \
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -51,10 +51,10 @@ g++ -O2 -m64  cld2_unittest.cc \
   cld_generated_cjk_delta_bi_4.cc generated_distinct_bi_0.cc  \
   cld2_generated_quadchrome_2.cc cld2_generated_deltaoctachrome.cc \
   cld2_generated_distinctoctachrome.cc  cld_generated_score_quad_octa_2.cc  \
-  -o cld2_unittest_chrome_2
+  -o cld2_unittest_chrome_2 $LDFLAGS
 echo "  cld2_unittest_chrome_2 compiled"

-g++ -O2 -m64  -Davoid_utf8_string_constants cld2_unittest.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -Davoid_utf8_string_constants cld2_unittest.cc 
\
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc \
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -64,7 +64,7 @@ g++ -O2 -m64  -Davoid_utf8_string_consta
   cld_generated_cjk_delta_bi_4.cc generated_distinct_bi_0.cc  \
   cld2_generated_quadchrome_2.cc cld2_generated_deltaoctachrome.cc \
   cld2_generated_distinctoctachrome.cc  cld_generated_score_quad_octa_2.cc  \
-  -o cld2_unittest_avoid_chrome_2
+  -o cld2_unittest_avoid_chrome_2 $LDFLAGS
 echo "  cld2_unittest_avoid_chrome_2 compiled"


--- cld2-0.0.0~svn193.orig/internal/compile_dynamic.sh
+++ cld2-0.0.0~svn193/internal/compile_dynamic.sh
@@ -15,7 +15,7 @@
 #  limitations under the License.

 # The data tool, which can be used to read and write CLD2 dynamic data files
-g++ -O2 -m64 cld2_dynamic_data_tool.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS cld2_dynamic_data_tool.cc \
   cld2_dynamic_data.h cld2_dynamic_data.cc \
   cld2_dynamic_data_extractor.h cld2_dynamic_data_extractor.cc \
   cld2_dynamic_data_loader.h  cld2_dynamic_data_loader.cc \
@@ -28,11 +28,11 @@ g++ -O2 -m64 cld2_dynamic_data_tool.cc \
   cld_generated_cjk_delta_bi_4.cc generated_distinct_bi_0.cc  \
   cld2_generated_quadchrome_2.cc cld2_generated_deltaoctachrome.cc \
   cld2_generated_distinctoctachrome.cc  cld_generated_score_quad_octa_2.cc  \
-  -o cld2_dynamic_data_tool
+  -o cld2_dynamic_data_tool $LDFLAGS
 echo "  cld2_dynamic_data_tool compiled"

 # Tests for Chromium flavored dynamic CLD2
-g++ -O2 -m64 -D CLD2_DYNAMIC_MODE compact_lang_det_test.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -D CLD2_DYNAMIC_MODE compact_lang_det_test.cc \
   cld2_dynamic_data.h cld2_dynamic_data.cc \
   cld2_dynamic_data_extractor.h cld2_dynamic_data_extractor.cc \
   cld2_dynamic_data_loader.h  cld2_dynamic_data_loader.cc \
@@ -41,12 +41,12 @@ g++ -O2 -m64 -D CLD2_DYNAMIC_MODE compac
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
   getonescriptspan.cc lang_script.cc offsetmap.cc  scoreonescriptspan.cc \
   tote.cc utf8statetable.cc  \
-  -o compact_lang_det_dynamic_test_chrome
+  -o compact_lang_det_dynamic_test_chrome $LDFLAGS
 echo "  compact_lang_det_dynamic_test_chrome compiled"


 # Unit tests, in dynamic mode
-g++ -O2 -m64 -g3 -D CLD2_DYNAMIC_MODE cld2_unittest.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -g3 -D CLD2_DYNAMIC_MODE cld2_unittest.cc \
   cld2_dynamic_data.h cld2_dynamic_data.cc \
   cld2_dynamic_data_loader.h  cld2_dynamic_data_loader.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc \
@@ -54,11 +54,11 @@ g++ -O2 -m64 -g3 -D CLD2_DYNAMIC_MODE cl
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
   getonescriptspan.cc lang_script.cc offsetmap.cc  scoreonescriptspan.cc \
   tote.cc utf8statetable.cc  \
-  -o cld2_dynamic_unittest
+  -o cld2_dynamic_unittest $LDFLAGS
 echo "  cld2_dynamic_unittest compiled"

 # Shared library, in dynamic mode
-g++ -shared -fPIC -O2 -m64 -D CLD2_DYNAMIC_MODE \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -shared -fPIC -D CLD2_DYNAMIC_MODE \
   cld2_dynamic_data.h cld2_dynamic_data.cc \
   cld2_dynamic_data_loader.h  cld2_dynamic_data_loader.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc compact_lang_det_hint_code.cc \
@@ -66,6 +66,6 @@ g++ -shared -fPIC -O2 -m64 -D CLD2_DYNAM
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
   getonescriptspan.cc lang_script.cc offsetmap.cc  scoreonescriptspan.cc \
   tote.cc utf8statetable.cc  \
-  -o libcld2_dynamic.so
+  -o libcld2_dynamic.so $LDFLAGS
 echo "  libcld2_dynamic.so compiled"

--- cld2-0.0.0~svn193.orig/internal/compile_full.sh
+++ cld2-0.0.0~svn193/internal/compile_full.sh
@@ -14,7 +14,7 @@
 #  See the License for the specific language governing permissions and
 #  limitations under the License.

-g++ -O2 -m64  compact_lang_det_test.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS compact_lang_det_test.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc \
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -24,10 +24,10 @@ g++ -O2 -m64  compact_lang_det_test.cc \
   cld_generated_cjk_delta_bi_32.cc generated_distinct_bi_0.cc  \
   cld2_generated_quad0122.cc cld2_generated_deltaocta0122.cc \
   cld2_generated_distinctocta0122.cc  cld_generated_score_quad_octa_0122.cc  \
-  -o compact_lang_det_test_full
+  -o compact_lang_det_test_full $LDFLAGS
 echo "  compact_lang_det_test_full compiled"

-g++ -O2 -m64  cld2_unittest_full.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS cld2_unittest_full.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc \
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -37,10 +37,10 @@ g++ -O2 -m64  cld2_unittest_full.cc \
   cld_generated_cjk_delta_bi_32.cc generated_distinct_bi_0.cc  \
   cld2_generated_quad0122.cc cld2_generated_deltaocta0122.cc \
   cld2_generated_distinctocta0122.cc  cld_generated_score_quad_octa_0122.cc  \
-  -o cld2_unittest_full
+  -o cld2_unittest_full $LDFLAGS
 echo "  cld2_unittest_full compiled"

-g++ -O2 -m64  -Davoid_utf8_string_constants cld2_unittest_full.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -Davoid_utf8_string_constants 
cld2_unittest_full.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc \
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -50,6 +50,6 @@ g++ -O2 -m64  -Davoid_utf8_string_consta
   cld_generated_cjk_delta_bi_32.cc generated_distinct_bi_0.cc  \
   cld2_generated_quad0122.cc cld2_generated_deltaocta0122.cc \
   cld2_generated_distinctocta0122.cc  cld_generated_score_quad_octa_0122.cc  \
-  -o cld2_unittest_full_avoid
+  -o cld2_unittest_full_avoid $LDFLAGS
 echo "  cld2_unittest_full_avoid compiled"

--- cld2-0.0.0~svn193.orig/internal/compile_libs.sh
+++ cld2-0.0.0~svn193/internal/compile_libs.sh
@@ -14,7 +14,7 @@
 #  See the License for the specific language governing permissions and
 #  limitations under the License.

-g++ -shared -fPIC -O2 -m64 \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -shared -fPIC \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc compact_lang_det_hint_code.cc \
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -24,9 +24,9 @@ g++ -shared -fPIC -O2 -m64 \
   cld_generated_cjk_delta_bi_4.cc generated_distinct_bi_0.cc  \
   cld2_generated_quadchrome_2.cc cld2_generated_deltaoctachrome.cc \
   cld2_generated_distinctoctachrome.cc  cld_generated_score_quad_octa_2.cc  \
-  -o libcld2.so
+  -o libcld2.so $LDFLAGS

-g++ -shared -fPIC -O2 -m64 \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -shared -fPIC \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc compact_lang_det_hint_code.cc \
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -36,5 +36,5 @@ g++ -shared -fPIC -O2 -m64 \
   cld_generated_cjk_delta_bi_32.cc generated_distinct_bi_0.cc  \
   cld2_generated_quad0122.cc cld2_generated_deltaocta0122.cc \
   cld2_generated_distinctocta0122.cc  cld_generated_score_quad_octa_0122.cc  \
-  -o libcld2_full.so
+  -o libcld2_full.so $LDFLAGS



(there is an ongoing debian effort to package it)

Original issue reported on code.google.com by [email protected] on 10 Feb 2015 at 3:36

cld2 testsuite failures

What steps will reproduce the problem?
1. checkout revision 194
2. use the cmake file (probably doesn't change anything)
3. use ubuntu 14.10 x64

build it and run tests
make[1]: Entering directory '/tmp/buildd/cld2-0.0.0~svn194'
cd obj-* && echo "this is some english text" | ./compact_lang_det_test_chrome_2
ExtLanguage ENGLISH(96% 1851p), 27/26 bytes of non-tag letters, Summary: ENGLISH
  SummaryLanguage ENGLISH at 0 of 26 81us (0 MB/sec), (null)
cd obj-* && echo "this is some english text" | ./compact_lang_det_test_chrome_16
ExtLanguage ENGLISH(96% 1851p), 27/26 bytes of non-tag letters, Summary: ENGLISH
  SummaryLanguage ENGLISH at 0 of 26 79us (0 MB/sec), (null)
cd obj-* && ./cld2_unittest_chrome_2 > /dev/null
*** Bad UTF-8 after 40 bytes<br>
Checking that non-dynamic implementations of dynamic data methods are no-ops 
(ignore the warnings).
WARNING: Dynamic mode not active, loadDataFromFile has no effect!
WARNING: Dynamic mode not active, loadDataFromRawAddress has no effect!
WARNING: Dynamic mode not active, unloadData has no effect!
Done checking non-dynamic implementations of dynamic data methods, care about 
warnings again.
PASS
cd obj-* && ./cld2_unittest_avoid_chrome_2 > /dev/null
*** Bad UTF-8 after 40 bytes<br>
Checking that non-dynamic implementations of dynamic data methods are no-ops 
(ignore the warnings).
WARNING: Dynamic mode not active, loadDataFromFile has no effect!
WARNING: Dynamic mode not active, loadDataFromRawAddress has no effect!
WARNING: Dynamic mode not active, unloadData has no effect!
Done checking non-dynamic implementations of dynamic data methods, care about 
warnings again.
PASS
cd obj-* && echo "this is some english text" | ./compact_lang_det_test_full
ExtLanguage ENGLISH(96% 1772p), 27/26 bytes of non-tag letters, Summary: ENGLISH
  SummaryLanguage ENGLISH at 0 of 26 153us (0 MB/sec), (null)
cd obj-* && ./cld2_unittest_full > /dev/null
PASS
cd obj-* && ./cld2_unittest_full_avoid > /dev/null
PASS
cd obj-* && ./cld2_dynamic_data_tool --dump cld2_data.bin
cd obj-* && ./cld2_dynamic_data_tool --verify cld2_data.bin
cd obj-* && echo "this is some english text" | 
./compact_lang_det_dynamic_test_chrome --data-file cld2_data.bin
Loading data from: cld2_data.bin
Data loaded, test commencing
ExtLanguage ENGLISH(96% 1851p), 27/26 bytes of non-tag letters, Summary: ENGLISH
  SummaryLanguage ENGLISH at 0 of 26 69us (0 MB/sec), --data-file
cd obj-* && ./cld2_dynamic_unittest --data-file cld2_data.bin > /dev/null
*** Bad UTF-8 after 40 bytes<br>
*** Bad UTF-8 after 40 bytes<br>
PASS
make[1]: Leaving directory '/tmp/buildd/cld2-0.0.0~svn194'


don't know, is everything ok?

Original issue reported on code.google.com by [email protected] on 12 Feb 2015 at 5:24

Enable dynamic data for 20141015 release

The 20141015 tables don't compile with the dynamic data tool because they are 
missing the hand-crafted "agnostic" constants that were put in for the old 
release. Attached is a patch that appears to make this work for the dynamic 
data tool.

Original issue reported on code.google.com by [email protected] on 31 Oct 2014 at 7:12

Attachments:

No langauges output despite isReliable=True

I'm using Mike McCandless' Python binding to cld2. I originally reported this 
issue to him, and he suggested I report it here (see 
https://code.google.com/p/chromium-compact-language-detector/issues/detail?id=15
).

The issue is that for a particular input string, cld2 reports that the 
prediction is reliable, but the set of languages detected is empty.

What steps will reproduce the problem?
1. import cld2
2. cld2.detect('interaktive infografik \xc3\xbcber videospielkonsolen')

What is the expected output? What do you see instead?
The output is 

(True, 49, ())


What version of the product are you using? On what operating system?
Python 2.7.3 (default, Aug  1 2012, 05:14:39) 
[GCC 4.6.3] on linux2

cld2 was built using SVN rev 63, 
cld python module was built using hg changeset b1cad3f04ef4

Original issue reported on code.google.com by [email protected] on 6 Aug 2013 at 3:08

Undefined language on a page that looks normal

Apparently, CLD2 has some difficulties(*) with 
http://drugoi.livejournal.com/3971967.html 

We are seeing UND (undefined) on chrome://translate-internals

*: or maybe we are mis-using it...

Original issue reported on code.google.com by [email protected] on 5 Mar 2014 at 6:59

Windows build fails: undeclared identifier 'close'

There appears to be a weird mix of both open() and fopen() (with corresponding 
close() and fclose()) in cld2_dynamic_data_loader.cc, and possibly other places 
in the code. We should consistently use one or the other. To use close() we'd 
also technically need to depend on unistd.h, I think, which we currently don't. 
This is causing some problems for Chromium, though why it has just cropped up 
now I could not say:

https://code.google.com/p/chromium/issues/detail?id=403222

The fix here should be trivial, and I'll take care of it now.

Original issue reported on code.google.com by [email protected] on 13 Aug 2014 at 9:41

SIGBUS on ARM32 in utf8statetable.cc:517

I'm trying to get CLD2 working on ARM32 inside of Chromium, cross-compiling 
from a linux x64 host to arm32. The library loads properly, but the following 
crash occurs when calling DetectLanguageSummary:

Program received signal SIGBUS, Bus error.
#0  CLD2::UTF8GenericScan (st=0x61a82104, str=<optimized out>, 
bytes_consumed=0x5f00f88c)
    at ../../third_party/cld_2/src/internal/utf8statetable.cc:518

I'll attach the full trace as a file. Well, minus the Chromium bits. Anyhow, 
the problem appears to be with this snippet of code in utf8statetable.cc:

  // Do fast for groups of 8 identity bytes.
  // This covers a lot of 7-bit ASCII ~8x faster than the 1-byte loop,
  // including slowing slightly on cr/lf/ht
  //----------------------------
  const uint8* Tbl2 = &st->fast_state[0];
  uint32 losub = st->losub;
  uint32 hiadd = st->hiadd;
  while (src < srclimit8) {
    uint32 s0123 = (reinterpret_cast<const uint32 *>(src))[0];
    uint32 s4567 = (reinterpret_cast<const uint32 *>(src))[1];
    src += 8;


Inspecting the pointers in the debugger during the crash, and looking at the 
"src" variable, seems to reveal the problem:
(gdb) p src
$32 = (
    const CLD2::uint8 *) 0x58de4bee "\n\n\n百度一下\n地图贴吧视频图片hao123\n新闻应用音乐文库更多\n小说游戏下载\n把百度放到桌面上,
搜索最方便\n触屏版极速版\nBaidu 京ICP证030173号"

Specifically, src is located at 0x58de4bee. Since this isn't a 4-byte (32-bit) 
aligned address, the SIGBUS presumably comes from trying to read it as a 
uint32*. Many thanks to [email protected] and [email protected] for the 
help in diagnosing this, I was a bit lost in the weeds looking at my dynamic 
data changes, which turn out to be completely unrelated (this happens with and 
without dynamic data mode).

The suggested workaround for this case is to %4 the address and do a one-off 
scan of the first 0-3 bytes (as necessary), and then descend into the fast 
loop; the concern is that there may be other places in CLD2 that have similar 
behavior and might be time bombs. It might be a good idea to add some memory 
churning code to the unit tests, and then start running the unit tests 
themselves on ARM to further diagnose other problems like this that might arise.

Original issue reported on code.google.com by [email protected] on 20 Mar 2014 at 3:11

Attachments:

Enable dynamic mode

As discussed offline, this is a patch to enable CLD2 to run in "dynamic" mode. 
In dynamic mode the kScoringtables struct is populated from a file at runtime 
instead of being compiled into the program as a read-only section in the binary.

This patch adds a new cld2_dynamic_data_tool and accompanying build 
instructions, and patches the unit tests to exercise all dynamic functionality. 
Data can be loaded, unloaded, and reloaded - theoretically allowing continuous 
operations of the program when updated tables are available.

It still has some hardcoding, but we can fix the underlying issues in the 
source code easily in another pass as you've suggested.

Original issue reported on code.google.com by [email protected] on 25 Feb 2014 at 6:30

Attachments:

cld2_dynamic_data.cc and cld2_dynamic_data_loader.cc problems on Win32

Chromium build output from one of the buildbots:

FAILED: ninja -t msvc -e environment.x86 -- E:\b\build\goma/gomacc 
"E:\b\depot_tools\win_toolchain\vs2013_files\VC\bin\amd64_x86\cl.exe" /nologo 
/showIncludes /FC 
@obj\third_party\cld_2\src\internal\cld2_dynamic.cld2_dynamic_data.obj.rsp /c 
..\..\third_party\cld_2\src\internal\cld2_dynamic_data.cc 
/Foobj\third_party\cld_2\src\internal\cld2_dynamic.cld2_dynamic_data.obj 
/Fdobj\third_party\cld_2\cld2_dynamic.cc.pdb 
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_dynamic_data.
cc(33) : error C2039: 'max' : is not a member of 'std'
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_dynamic_data.
cc(33) : error C3861: 'max': identifier not found
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_dynamic_data.
cc(85) : warning C4018: '<' : signed/unsigned mismatch
FAILED: ninja -t msvc -e environment.x86 -- E:\b\build\goma/gomacc 
"E:\b\depot_tools\win_toolchain\vs2013_files\VC\bin\amd64_x86\cl.exe" /nologo 
/showIncludes /FC 
@obj\third_party\cld_2\src\internal\cld2_dynamic.cld2_dynamic_data_loader.obj.rs
p /c ..\..\third_party\cld_2\src\internal\cld2_dynamic_data_loader.cc 
/Foobj\third_party\cld_2\src\internal\cld2_dynamic.cld2_dynamic_data_loader.obj 
/Fdobj\third_party\cld_2\cld2_dynamic.cc.pdb 
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_dynamic_data_
loader.cc(99) : error C2220: warning treated as error - no 'object' file 
generated
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_dynamic_data_
loader.cc(99) : warning C4018: '<' : signed/unsigned mismatch
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_dynamic_data_
loader.cc(235) : warning C4018: '<' : signed/unsigned mismatch

This should be fixed.

Original issue reported on code.google.com by [email protected] on 1 Oct 2014 at 2:36

Missing Apache license header text in several source files

The following files are affected:
cld2_generated_quadchrome0122_16.cc
cld2_generated_deltaoctachrome0122.cc
cld2_generated_deltaocta0122.cc
cld2_generated_quadchrome0122_19.cc
cld2_generated_distinctoctachrome0122.cc
cld2_generated_distinctocta0122.cc
cld2_generated_quadchrome0122_2.cc

Unfortunately this prevents Chrome from rolling to the latest CLD2. Patch 
attached.

Original issue reported on code.google.com by [email protected] on 14 Mar 2014 at 5:20

Attachments:

Translation bar shows up for the English website and detects as "Malay"


What steps will reproduce the problem?
1. Launch chrome with the flag --force-fieldtrials=CLD1VsCLD2/CLD2/
2. Open website <https://play.google.com/store>
3. Go to movie and Romancing Bollywood then click on see more movies.
4. For India location it detect "Malay" language of the page although this page 
is in English language (refer attached screenshot.)

What is the expected output? 
No translation bar as the language of website is English.

What do you see instead?
translation bar asking for translation from Malay to English.

What version of the product are you using? On what operating system?
Version: 32.0.1657.2 (Official Build 226144) 
OS: Linux Ubuntu 12.04

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 7 Nov 2013 at 9:39

Attachments:

Can't link "dynamic" and "full"

What steps will reproduce the problem?

This "g++" command-line is a mix between "full" and "dynamic":

g++ -O2 -m64 cld2_dynamic_data_tool.cc cld2_dynamic_data.cc 
cld2_dynamic_data_extractor.cc cld2_dynamic_data_loader.cc cldutil.cc 
cldutil_shared.cc compact_lang_det.cc compact_lang_det_hint_code.cc 
compact_lang_det_impl.cc debug.cc fixunicodevalue.cc generated_entities.cc 
generated_language.cc generated_ulscript.cc getonescriptspan.cc lang_script.cc 
offsetmap.cc scoreonescriptspan.cc tote.cc utf8statetable.cc 
cld_generated_cjk_uni_prop_80.cc cld2_generated_cjk_compatible.cc 
cld_generated_cjk_delta_bi_32.cc generated_distinct_bi_0.cc 
cld2_generated_quad0122.cc cld2_generated_deltaocta0122.cc 
cld2_generated_distinctocta0122.cc cld_generated_score_quad_octa_0122.cc -o 
cld2_dynamic_data_tooandl

What is the expected output? What do you see instead?

cld2_dynamic_data_tool.cc:(.text.startup+0x293): Undefined 
`CLD2::kQuadChromeIndSize'
cld2_dynamic_data_tool.cc:(.text.startup+0x29d): Undefined 
`CLD2::kQuadChrome2IndSize'



Original issue reported on code.google.com by [email protected] on 9 Apr 2014 at 9:59

Eliminate redundancy and/or simplify default case for compiling unittest_data.h

internal/unittest_data.h seems to use a mixture of escape sequences and raw 
non-ASCII text. For maximum portability and safety, it would be best for the 
source code to use all ASCII characters and escape the non-ASCII characters. 
This should help compiler compatibility, though there are no reports of 
breakage since this Chromium.org issue back in 2009:

https://code.google.com/p/chromium/issues/detail?id=20033

The change should be simple enough, and a script can be written to perform the 
transformation.

Original issue reported on code.google.com by [email protected] on 26 Aug 2014 at 3:27

Support mmap-ing dynamic data on win32

As described in issue 19, the current implementation of dynamic data won't work 
in windows because it relies on:
 * from sys/mman.h: mmap(), munmap()
 * from unistd.h: close()

These header files don't exist in vanilla win32 build environments, so 
compatibility is broken. The fix for close() is being implemented in issue 19, 
but the fix for mmap() is less straightforward.

Original issue reported on code.google.com by [email protected] on 13 Aug 2014 at 11:40

Check in tools for generating generated_* files

There's a lot of data files that get generated, such as 
cld_generated_cjk_uni_prop_80.cc and its ilk. There have been several problems 
in the past with the generated files that have necessitated post-generation 
fixes, e.g.:

https://code.google.com/p/cld2/source/detail?r=155
https://code.google.com/p/cld2/source/detail?r=156
https://code.google.com/p/cld2/source/detail?r=189
https://code.google.com/p/cld2/source/detail?r=192
https://code.google.com/p/cld2/source/detail?r=193

...

And now we have issue 32, which is more of the same. We don't have the 
templates or whatever are used to generated these source files checked in; we 
should. I get that the actual data is huge and isn't something we'd store in 
Git, but I'd really like to see us put the templates/generators into the code 
base so that we can maintain them alongside the code that they produce.

High priority because I feel that at this point there is likely drift between 
the templates and the code they produce; we should probably get the templates 
checked in and iterate on them until they produce exactly the same files that 
we have today, then proceed forward with maintenance.

WDYT?

Original issue reported on code.google.com by [email protected] on 1 May 2015 at 8:34

new code location?

Hi, since google code is closing, where do you plan to move the packaging?

thanks!

Original issue reported on code.google.com by [email protected] on 6 May 2015 at 2:21

CLD2DynamicDataLoader calls delete instead of delete[] on array types

Upon running some browser tests in Chrome, the following error was encountered 
when attempting to call CLD2::loadDataFromRawAddress():

memory allocation/deallocation mismatch at 0x155bb621cb20: allocated with new 
[] being deallocated with delete
Received signal 11 SEGV_MAPERR 000000000039
...
#6 0x000002b7b791 MallocBlock::CheckLocked()
#7 0x000002b7b422 MallocBlock::CheckAndClear()
#8 0x000002b7bb4a MallocBlock::Deallocate()
#9 0x000002b79109 DebugDeallocate()
#10 0x000009e02885 operator delete()
#11 0x000006ecd635 CLD2DynamicDataLoader::loadDataInternal()
#12 0x000006ecd325 CLD2DynamicDataLoader::loadDataRaw()
#13 0x000006eba963 CLD2::loadDataFromRawAddress()

I'm not sure why this wasn't caught earlier in testing. It may be a consequence 
of toolchain changes in Chromium, but the error seems valid and should be 
fixed. This was previously working without issue on both Linux and Android 
platform builds for x64 and ARM respectively.

I will review the other uses of delete to see if there are more occurrences. 
This should be a trivial fix, but blocks adoption of CLD2 dynamic mode in 
Chromium.

Original issue reported on code.google.com by [email protected] on 15 May 2014 at 4:32

Dynamic data loading should not use iostream

Dynamic data loading currently uses iostream for logging.

That would be fine, except that nowhere else in the library is iostream used, 
meaning this is bringing in many classes for little gain, and only when dynamic 
data loading is turned on.

Original issue reported on code.google.com by [email protected] on 15 Jul 2014 at 9:39

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.