Fast Database Inserts with Python 3.6 and SQLite

SQLite and Python 3.6 are fast enough for me on many projects. However, I was curious just how fast I could process, and insert, a large number of records into a SQLite database using Python.

So lets start with the results and follow-up with how I did it.

Results

Input and Output File Sizes

Text Input File:  1.3 GB
SQLITE Output File:  1.06 GB

Time Breakdown

File Iteration Time206 seconds
SQLite Insert Time:52 Seconds
Total Time:258 Seconds
Number of Records:3573026 (Three million, five hundred seventy three thousand, and twenty six)

I consider the total time acceptable, for a data set of this size, and the individual processing time of SQLite (of 3.5 million records), on par with other databases, for the number of records; on the same computer work station.

The largest amount of processing time was opening and iterating over the file as expected. The iteration time I found acceptable for my project use-case.

Technical Details

The following, consumer retail-boxed desktop system, was used; as it was what I happened to have available. You will notice the system used is quite old now with the original shipping date by Dell in May of 2008. I picked it up from a well-known big-box store at the time it was still new.

My point is; this is not a server class machine. It is an old Windows machine I converted to Debian Linux. (Jessie)

Computer System

Type:Dell XPS 420
OS:Debian 8 Jessie (x86-64) [ latest updates applied ]
Linux Kernel:3.16.04-amd64
Processor:Intel Core 2 Duo Quad CPU Q6600 @ 2.4GHZ x 4 [1066FSB] 8 MB cache
Installed Memory:6 GB
Video Card:ATI Radeon HD 3450
Desktop:Cinnamon

Software System

Python 3.6 / GCC 4.9.2 running in a virtualenv
SQLite – Version included with Python 3.6

Time Measurement

I used the Python “timeit” functions to get clock times, not processor times.

Data Information

The data used for testing (input text file) was a JSON formatted strings, representing 1 line per record, that I parsed from a DMOZ data dump file. I used the Python JSON library to read the lines as JSON objects to pick-up the column values for insert.

The SQLite database was created as an empty database with five (5) columns of type TEXT with NO indexes and no length restrictions. As a point of reference, the five columns I used from the DMOZ data file were: Title, Description, Priority, Topic, and URL.

Testing Comments

This test, is by no-means, a benchmark test and should not be considered as such. I simply ran enough tests to get an estimate of processing time needed for a large record-set. As with all estimates, performance will vary, especially on a desktop system (non-server), depending on other running applications at the time of testing.

My goal was simply: “Can I use Python 3.6 and SQLite, in reasonable way (on old equipment) to accomplish my specific use-case?”. For me, the answer was yes.

Python 3.6 Code I Used

The code below opens a file, iterates, converts lines to JSON objects, stuffs the JSON object values into tuple, and appends the resulting tuple to a list. The list is used with the SQLite “executemany” (cursor object) to batch insert into an existing empty SQLite database with matching columns configured.

import json
import sqlite3
import timeit  # for simplification I left out how I implemented timing for this code.

print("Started Processing")
with open("content.json") as content_file:
    records = []
    for line in content_file:
        jo = json.loads(line)
        record = (jo.get('d:Title'), jo.get('d:Description'), jo.get('priority'), jo.get('topic'), jo.get('url'))
        records.append(record)

    try:
        conn = sqlite3.connect("content_test.db")
        c = conn.cursor()
        c.executemany("INSERT INTO MainContent VALUES (?, ?, ?, ?, ?)", records)
        conn.commit()
        conn.close()
    except:
        # TODO: logging here
        raise

print("Finished Processing")

Additional Notes

While the processing and insert time of the data was reasonable, I found querying the resulting SQLite database slow without indexes; as should be expected of a data-set with millions of records. However, even without indexes the query times are acceptable (to me) but to make good use of the data, indexes are generally a good idea. My test system is old and the longest time period during the query process, for this data-set, was opening the large file. Once the file was opened record retrieval was very fast.

Overall, I’m very happy with the performance of Python 3.6 and SQLite on my old machine. I am sure different approaches, to gathering values for insert, will prove to increase performance.

DMOZ is the organization which produced a internet directory and was previously found at dmoz.org and now found at curlie.org.

Leave a Reply

Your email address will not be published. Required fields are marked *