Giter Club home page Giter Club logo

trendyolscraper's Introduction

Getting Started

This is a script written in python 3 that uses selenium to scrape images and metadata of Trendyol.com. Also a attribute analysis script is included to generate excel and log files that describes the downloaded data. This script has a feature to generate .csv files according to a labelmap so that downloaded dataset can easily be used for machine learning.

Installation

In order to install packages required to run the scripts, run the following command

pip install -r requirements.txt

Usage

This repository has three different scripts. The main script that does the scraping is named TrendyolScraper.js

These are the arguments for TrendyolScraper

  • --url    The url of the trendyol search that will be scraped. REQUIRED

  • --urlsPath    The path to the .txt file that contains the list of urls with each url being in each line

  • Note: Either one "url" or "urlsPath" is REQUIRED

  • --path    The path of the directory that all the image and .meta files will be downloaded into.

  • --max    Maximum number of images that will be downloaded, no limit as default. OPTIONAL

  • --prefix    A prefix that will be put in front of all files downloaded, use this if you are going to make multiple downloads on the same directory otherwise files from the first dowload will be overridden. No prefix at default.OPTIONAL

  • -n    If you do not want to download the scraped images, this mode will still generate the .meta files

  • -l    If you want to create a .txt file that lists the urls of scraped images. You can later use this .txt file with the download_images.py to download images to a remote location without having to need to rescrape.

Example usage

 python TrendyolScraper.py --url "https://www.trendyol.com/erkek-gomlek-x-g2-c75" --path ./Dataset --max 100 --prefix m

Note: Do not pass a --max argument if you want to dowload as much as possible

 

Second script is the download_image.py which is a tool to efficiently download images in bulk from the .txt file generated in the first script(TrendyolScraper.py)

It has three arguments

  • --file    The path of .txt file that has the formatted list of urls to be downloaded. REQUIRED

  • --dir    The directory where the images will be downloaded, default is the directory where the script is ran. OPTIONAL

  • --prefix    A prefix that will be put in front of all files downloaded, use this if you are going to make multiple downloads on the same directory otherwise files from the first dowload will be overridden. No prefix at default. OPTIONAL

Example usage

 python download_images.py --file "m-imageUrls.txt" --dir "./images" --prefix m

 

Third script is the attribute_analysis.py which provides few tools for interpreting the data that you downloaded

This script has three modes,

  • -xlsx    The script will generate excel file that contains all the attribute categories and attributes found within the metadata of the images in the specified directory along with the statistics of how many images were labeled with those attributes.

  • -d    The script will create a .txt file with detailed information of which atttributes were labeled for every image file in the specified directory.

  • -csv    The script will generate a .csv file that describes the entire dataset found in the specified directory according to a labelmap file, this has to be used along with --labelmap argument. See the description of --labelmap argument for detailed explanation of how to use this mode

also the script has two arguments

  • --path    The path of the directory that will be scanned for .meta files. REQUIRED

  • --labelmap    The path to the .json file that contains the labelmap for .csv file

An example label:

{
    "Kol Tipi": { //The exact name of the category as found in the .meta files
        "name": "Sleeve Type", //The name of the category that will be written into the .csv file, you can change this as you want
        "attributes": [ //List of attributes that belong to the category
            {
                "name": "Short Sleeve", //The name of the attribute, you can change this as you want. This is not written into .csv file and is here for postprocessing purposes
                "subattributes": [ //List of the exact names of the attributes as found in the .meta files, if you put multiple names they will be merged into this single attribute
                    "Kısa Kol",
                    "Kısa"
                ]
            }
        ]
    },
    "Renk" : {...}
}

Important note: for the "exact names" you need the exact names of the attributes that are given inside .meta files, You can generate a excel file by running this script in -xlsx mode to see all of the exact names of the categories and attributes easily.

See example_labelmap.json for a complete example of a labelmap generated suitable to a dataset dowloaded from the links https://www.trendyol.com/kadin-t-shirt-x-g1-c73 and https://www.trendyol.com/erkek-t-shirt-x-g2-c73

Example usage

python attribute_analysis.py  --path ./Dataset --labelmap example_labelmap.json

The csv file created with this script may look like this:

File Name,Gender,Color,Sleeve type,Collar Type,Pattern,Material Type,Fit,Style
m-1003_2.jpg,0,3,0,0,0,1,3,0

Numbers like 0 and 3 in the csv corresponds to the index of the attributes as they were given in the order of your labelmap. For example the firt 3 in the sequence point to the fourth attribute that was given in the "attributes" list of the color category, which was green for my case.

trendyolscraper's People

Contributors

isaturk66 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.