Crawling for images

5 min readMay 19, 2023

Many data science projects require a vast amount of data to properly train their models. As a result, data has evolved into a valuable commodity, prompting certain companies to monetize its sale. However, an alternative approach is to gather data independently, a process that necessitates particular skills, reliable sources, and most critically - time.

In this article, I’ll share what I’ve recently learned about data crawling techniques for acquiring a large volume of high-quality images. I will primarily concentrate on the technical aspects of discovering and downloading these resources.

50,000-foot perspective

In order to effectively collect data I involved the following components in the image crawling process:

The coordinator is the entity that maintains a record of what has already been discovered, visited, or downloaded. Its responsibility lies in assigning tasks to more specialized workers and collecting their outputs once the tasks are completed.

The storage is the mechanism that manages the artifacts already acquired, such as images, metadata, or links. It offers an efficient method for storing or retrieving these artifacts.

The discovery worker takes on the role of opening a link and identifying valuable elements on the page. This may involve extracting images, discovering additional links to browse, or gathering useful metadata.

The download worker, as the name suggests, is solely tasked with downloading the identified images. This component doesn’t involve decision-making but executes the final step of the process.

The storage component

The storage part seems to just keep downloaded pictures, their metadata, and found links. But its design really decides about the success or failure of the process of collecting images.

I made my own solution for this. I decided to use my SSD and set up a very specific system of folders. It’s built to handle lots of data and lets me check any item manually. I ended up with this setup of folders:

root
- queue
  - 1683388
    - 168338802
    - 168338803
    - 168338802
  - 1683389
    - 168338908
    - 168338910
- visited
  - 1413
    - 1413745000559-46fdd2d81cd7
    - 1413745066752-18f122473e3b.v3
    - 1413882353314-73389f63b6fd.v5+t
- downloaded
  - 600
    - 1413
      - 1413708617479-50918bc877eb.png
      - 1413725834248-42db82938bef.png
  - 1200
    - 1413
      - 1413708617479-50918bc877eb.png
      - 1413725834248-42db82938bef.png

The system of folders is set to hold three types of items:

Queue works as FIFO queue which contains JSON files holding a links to visit in the future. It will grow over time to a lot of small files (<10 kB). All files have a filename which is the current timestamp when the links were discovered. Additionally to avoid directory pollution they are packed in a subdirectories prefixed with first 9 characters of the filename.
Visited directory contains small JSON files for each discovered image with all available metadata, like the URL to the actual image or its tags. All files support versioning which is critical when I discovered I need to enhance more attributes when the data was partially created.
Downloaded directory contains actual images. It is additionally split by the width of the image. All names are aligned with filenames in the visited directory.

Using filesystem as the backend for crawling doesn’t mean that each operation on storage involves I/O. Just opposite, in the actual code I cached all visited and downloaded keys to do very effective operations on sets. I/O is involved mostly when new data arrives.

The download worker

Downloading images seems to be very straightforward task and it is. It requires only to know what should be downloaded and requests package can do it for you. As always — some error handling is required. In my implementation new download requests are fetched from the incoming queue and returned back to the outgoing queue.

The discovery worker

Just as a human can look at any web page and identify images they’d like to have, an automatic worker can do the same. On the page, we may notice additional details such as an image description or its tags. We might also spot links leading us to another web page filled with more images. Assuming we have a starting point, this automated worker can endlessly browse the internet or a single website for valuable resources. At each page, it has the ability to extract potential images to download, gather their metadata, and identify links that could be investigated in the future.

The automation process can either employ the pure HTTP protocol to fetch a website’s content, or it can mimic human behavior using the Selenium package to appear as a genuine user. Utilizing Selenium offers additional advantages, such as maintaining valid cookies or executing scripts to construct the final page. In my journey, I chose to use Selenium. Past experiences showed me that using plain requests often led to temporary blocks on my IP address.

Implementation resembles downloading. The incoming queue delivers pages to look at, the outgoing queue accepts results.

The coordinator

The last part of the system is the coordinator. It knows how to talk with the Storage part and how to give work to discovery and download workers. It handles two queues for each type of worker and one queue where the workers send back their results. Basically, it’s just a loop that arranges and waits for data.

Outcome

I’ve run the script for a while on my home network, targeting just one popular website with free images. It appears that the script can successfully download over a million images. Feel free to experiment with it, but always remember the ethical considerations involved.

https://github.com/amacal/learning-python/tree/crawling-for-images