Skip to content

Loading a Large Amount of Data

Before You Unzip Your Archive...

There is limited storage space on Turing. While we have a lot, it fills up quickly and can negatively impact your project's performance. Often, large zip files contain millions of smaller files. When extracted to Turing's file system, it puts serious strain on the file server and creates communication bottlenecks in your code. If you can avoid unzipping your files, we encourage you not to.

Keeping it Zipped

Most programming languages have easy-to-use utilities to read data directly from a zip file. For example, to read all data files in an archive in python:

import zipfile

with ZipFile('training_data.zip') as myzip:
    for file in myzip.namelist():
        with myzip.open(file) as myfile:
            print(myfile.read())
For languages that do not have built-in support for reading zip file data, you can use the provided modules on the cluster to load libraries that provide this functionality.