myd1 / dmoz2db Goto Github PK
View Code? Open in Web Editor NEWThis project forked from joknopp/dmoz2db
A database importer for the open directory project (aka dmoz) data
This project forked from joknopp/dmoz2db
A database importer for the open directory project (aka dmoz) data
dmoz2db is a tool to parse the RDF-like dumps from http://rdf.dmoz.org/rdf/ and put the contents into a database. dmoz2db is tested with MySQL but should work with other databases as well. IT COMES WITH ABSOLUTELY NO WARRANTY OF ANY KIND. Instructions To use dmoz2db you need to install sqlalchemy 0.6.5 or higher (http://www.sqlalchemy.org) Your database must have utf8 support enabled. For MySQL a description how to do that is available here: http://cameronyule.com/2008/07/configuring-mysql-to-use-utf-8 The database where the dmoz data will be stored must be created manually: mysql> create database DATABASENAME; mysql> GRANT ALL ON DATABASENAME.* TO 'USERNAME'@'localhost'; After that you should edit db.sample.conf according to your setup and save it as db.conf. The database design can be found in the html pages in the doc folder. Running If the rdf files are present in your current directory you can just say ~/dmoz-dir/src $ python dmoz2db.py but you may want to run ~/dmoz-dir/src $ python dmoz2db.py --help first and look at the available options. Most of them should be self explaining. If you are not interested in the complete dmoz dataset you can specify a topic filter to ignore everything which is not under the given category which speeds up the import process. Take care with trailing slashes: 'Top/Computers' includes the category while 'Top/Computers/' filters for everything under that category. The default father id is 1 for every category whose father was filtered out. Debug output should be turned on only in combination with the log file option because every sql statement is printed. The import will take time, so go to lunch or find something else to do :) And don't halloo till you're out of the wood: There is a first parse inserting the basic topic structure into the db, then the father ids are generated and after that all the additional information like related categories or other languages are added in the second parse. Last but not least the content.rdf file is parsed to add the externalpage information to the database. On my laptop it took about 20 minutes to complete the first parse, 25 minutes for generating father ids, 2:08 h for the second parse and 8 h for the content.rdf file which results in ~11 h total. One dot in the output means 10,000 processed topics, a newline is generated after 200,000 Topics. In the structure.rdf file entries dealing with the last editor are ignored. For content.rdf the tags <mediadate>, <type>, <uksite>, <age> and <priority> are ignored because they are present only in a fraction of the data.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.