This project is my bachelor's thesis on the topic: "Generation of images of urban landscapes in high resolution from a textual description using deep learning methods."
For a text that takes place in a city, it generates images of the corresponding urban location in 1024x1024 resolution.
The service works as follows:
- The service accepts a request containing text in Russian. The text is translated into English.
- Next, a text is sent to the third party service, which predicts its location using NLP methods. It is assumed that location is urban.
- Images corresponding to the text description of the obtained location are loaded from free photo stocks (Depositphotos, Unsplash).
- The uploaded images are segmented into classes corresponding to the objects of the urban environment (road, sidewalk, building, etc.). For this, SegFormer is used.
- For each obtained segmentation, several images are generated using OASIS.
- The most appropriate to textual description generated images are selected using CLIP score.
- Finally, x2 super resolution is performed on each image using Real-ESRGAN.
Download weights for SegFormer model from here and unpack it into segmentation/.
Download weights for OASIS model from here and unpack it into generation/OASIS/checkpoints.
docker build -t t2i-urban .
docker run -d -p 8080:8080 --name=[container-name] t2i-urban
docker start [container-name]
docker attach [container-name]
pip install -v -e .
Running docker container sets up a server (on localhost:8080) that accepts requests in the following format:
- method:
get-images
- Content-Type:
application/json
- request format:
{"text": string}
- answer format:
{"result": list<base64-string>}
(list of base64-encoded images)
If you want to generate images manually run main.py. Usage:
python main.py --text TEXT [--save-dir DIR] [--samples SAMPLES] [--from-text]
--text TEXT - input text
--save-dir DIR - results will be saved here (default is './results')
--samples SAMPLES - number of images to be generated (default is 10)
--from-text - with this option enabled images are generated directly from the text
instead of generating from text location
(thus text must contain a description of the urban location)
Third party service, which predicts text's location can be unavailable. In this case use --from-text
option to generate images directly from input text. Thus text must contain a description of the urban location.