This article shows how to generate large file using python.

1. The environment

  • Python 2.7.10

2. The targets

There are four targets in this post:

  • generate a big binary file filled by random hex codes
  • generate a big text file filled by random aphabets/letters
  • generate a big empty/sparse file
  • genrate a big text file filled by lines of random string(sentences)

2.1 generate a big binary file filled by random hex codes

def generate_big_random_bin_file(filename,size):
    """
    generate big binary file with the specified size in bytes
    :param filename: the filename
    :param size: the size in bytes
    :return:void
    """
    import os 
    with open('%s'%filename, 'wb') as fout:
        fout.write(os.urandom(size)) #1

    print 'big random binary file with size %f generated ok'%size
    pass

the line #1 used os.urandom function to generate random bytes,this is the explanation of this function:

os.urandom(n)

Return a string of n random bytes suitable for cryptographic use.

This function returns random bytes from an OS-specific randomness source. The returned data should be unpredictable enough for cryptographic applications, though its exact quality depends on the OS implementation. On a UNIX-like system this will query /dev/urandom, and on Windows it will use CryptGenRandom(). If a randomness source is not found, NotImplementedError will be raised.

run it:

if __name__ == '__main__':
    generate_big_random_bin_file("temp_big_bin.dat",1024*1024)

and we got this result:

-rw-r--r--  1 zzz  staff   1.0M Apr 25 22:04 temp_big_bin.dat

2.2 generate a big text file filled by random aphabets/letters

def generate_big_random_letters(filename,size):
    """
    generate big random letters/alphabets to a file
    :param filename: the filename
    :param size: the size in bytes
    :return: void
    """
    import random
    import string

    chars = ''.join([random.choice(string.letters) for i in range(size)]) #1


    with open(filename, 'w') as f:
        f.write(chars)
    pass

The key point is the random.choice function. Here is the an introduction of this function:

random.choice(seq)

Return a random element from the non-empty sequence seq. If seq is empty, raises IndexError.

run it:

if __name__ == '__main__':
    generate_big_random_letters("temp_big_letters.txt",1024*1024)

and we got this result:

-rw-r--r--  1 zzz  staff   1.0M Apr 25 22:15 temp_big_letters.txt

2.3 generate a big empty/sparse file

def generate_big_sparse_file(filename,size):
    f = open(filename, "wb")
    f.seek(size - 1)
    f.write("\1")
    f.close()
    pass

The key point is the f.seek function call ,it would set the pointer to the end of the file and write a byte.

run it:

if __name__ == '__main__':
    generate_big_sparse_file("temp_big_sparse.dat",100)

and we got this file content:

0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0001

We can see that the byte 0001 is at the last of the file.

2.4 genrate a big text file filled by lines of random string(sentences)

def generate_big_random_sentences(filename,linecount):
    import random
    nouns = ("puppy", "car", "rabbit", "girl", "monkey")
    verbs = ("runs", "hits", "jumps", "drives", "barfs")
    adv = ("crazily.", "dutifully.", "foolishly.", "merrily.", "occasionally.")
    adj = ("adorable", "clueless", "dirty", "odd", "stupid")

    all = [nouns, verbs, adj, adv]

    with open(filename,'w') as f:
        for i in range(linecount):
            f.writelines([' '.join([random.choice(i) for i in all]),'\n'])
    pass

the key points are:

  • setup four arrays which contains elements of sentences,e.g. the nons/verbs/adv/adj words
  • construct an array of array for use
  • use the random.choice to select a random word to construct a random sentence

run it(generate 1000 lines of sentences):

if __name__ == '__main__':
    generate_big_random_sentences("temp_big_sentences.txt",1000)

and we got this file content:

==> temp_big_sentences.txt <==
car hits stupid dutifully.
puppy runs dirty occasionally.
rabbit barfs adorable occasionally.
puppy barfs adorable occasionally.
girl jumps odd foolishly.
monkey runs adorable crazily.
girl drives stupid foolishly.
puppy drives clueless dutifully.
car hits clueless crazily.
girl barfs adorable crazily.
......

You can find detail documents about the python IO here: