SQLite and Python 3.6 are fast enough for me on many projects. However, I was curious just how fast I could process, and insert, a large number of records into a SQLite database using Python.
So lets start with the results and follow-up with how I did it.
Results
Input and Output File Sizes
Text Input File: 1.3 GB | |
SQLITE Output File: 1.06 GB |
Time Breakdown
File Iteration Time | 206 seconds |
SQLite Insert Time: | 52 Seconds |
Total Time: | 258 Seconds |
Number of Records: | 3573026 (Three million, five hundred seventy three thousand, and twenty six) |
I consider the total time acceptable, for a data set of this size, and the individual processing time of SQLite (of 3.5 million records), on par with other databases, for the number of records; on the same computer work station.
The largest amount of processing time was opening and iterating over the file as expected. The iteration time I found acceptable for my project use-case.
Technical Details
The following, consumer retail-boxed desktop system, was used; as it was what I happened to have available. You will notice the system used is quite old now with the original shipping date by Dell in May of 2008. I picked it up from a well-known big-box store at the time it was still new.
My point is; this is not a server class machine. It is an old Windows machine I converted to Debian Linux. (Jessie)
Computer System
Type: | Dell XPS 420 |
OS: | Debian 8 Jessie (x86-64) [ latest updates applied ] |
Linux Kernel: | 3.16.04-amd64 |
Processor: | Intel Core 2 Duo Quad CPU Q6600 @ 2.4GHZ x 4 [1066FSB] 8 MB cache |
Installed Memory: | 6 GB |
Video Card: | ATI Radeon HD 3450 |
Desktop: | Cinnamon |
Software System
Python 3.6 / GCC 4.9.2 running in a virtualenv
SQLite – Version included with Python 3.6
Time Measurement
I used the Python “timeit” functions to get clock times, not processor times.
Data Information
The data used for testing (input text file) was a JSON formatted strings, representing 1 line per record, that I parsed from a DMOZ data dump file. I used the Python JSON library to read the lines as JSON objects to pick-up the column values for insert.
The SQLite database was created as an empty database with five (5) columns of type TEXT with NO indexes and no length restrictions. As a point of reference, the five columns I used from the DMOZ data file were: Title, Description, Priority, Topic, and URL.
Testing Comments
This test, is by no-means, a benchmark test and should not be considered as such. I simply ran enough tests to get an estimate of processing time needed for a large record-set. As with all estimates, performance will vary, especially on a desktop system (non-server), depending on other running applications at the time of testing.
My goal was simply: “Can I use Python 3.6 and SQLite, in reasonable way (on old equipment) to accomplish my specific use-case?”. For me, the answer was yes.
Python 3.6 Code I Used
The code below opens a file, iterates, converts lines to JSON objects, stuffs the JSON object values into tuple, and appends the resulting tuple to a list. The list is used with the SQLite “executemany” (cursor object) to batch insert into an existing empty SQLite database with matching columns configured.
import json import sqlite3 import timeit # for simplification I left out how I implemented timing for this code. print("Started Processing") with open("content.json") as content_file: records = [] for line in content_file: jo = json.loads(line) record = (jo.get('d:Title'), jo.get('d:Description'), jo.get('priority'), jo.get('topic'), jo.get('url')) records.append(record) try: conn = sqlite3.connect("content_test.db") c = conn.cursor() c.executemany("INSERT INTO MainContent VALUES (?, ?, ?, ?, ?)", records) conn.commit() conn.close() except: # TODO: logging here raise print("Finished Processing")
Additional Notes
While the processing and insert time of the data was reasonable, I found querying the resulting SQLite database slow without indexes; as should be expected of a data-set with millions of records. However, even without indexes the query times are acceptable (to me) but to make good use of the data, indexes are generally a good idea. My test system is old and the longest time period during the query process, for this data-set, was opening the large file. Once the file was opened record retrieval was very fast.
Overall, I’m very happy with the performance of Python 3.6 and SQLite on my old machine. I am sure different approaches, to gathering values for insert, will prove to increase performance.
DMOZ is the organization which produced a internet directory and was previously found at dmoz.org and now found at curlie.org.