Python Data Processing Example & Story Behind It

Processing Today

Businesses today rely on data processing to build business intelligence, reporting, and transactional information. Common tasks involve capturing raw data files from other business or from departments within the same business. Data processing is handled with a variety of software ranging from commercial off the shelf (COTS) software to custom scripting, applications and processes.

This example falls under the custom software and scripting aspect of business data processing. This example demonstrates how to take raw data and transform it into something useful for data application testing.

On GitHub
On Python PyPi

The data processing basics are:

  1. Loading and parsing one or more data files.
  2. Combining parsed data into different output for use in other applications.
  3. Mapping data to custom classes.
  4. Producing mock data for testing purposes.

Additionally, from a Python programming perspective, the code touches upon:

  1. Creating a Python console application for data processing.
  2. Handling of command line arguments in Python.
  3. Loading raw data efficiently.
  4. Configuring simple and effective application settings (one of many ways).
  5. Mapping classes with class inheritance to data. (object oriented programming).
  6. Randomizing word lists and number lists.
  7. Filtering data by field lengths using Python list comprehensions.
  8. Distributing, configuring, setting up, and publishing to PyPi.
  9. Creating different types of comma separated values (CSV) lists.
  10. Formatting concepts; such as CSV with or without headers or varying fields.

Other Interesting Data Points

The compiled data in this application includes U.S.A. census data as:

  1. Common male & female first names.
  2. Common last names found in the United States of America. (U.S.A.).
  3. U.S.A. Cities listing with GPS (global positioning system) data and commonly used aliases.
  4. Additional city aliases by county.
  5. U.S.P.S (US Postal Service) standardized delivery address suffixes: Ex: Alley, Alee., Aly.
  6. A large list of English words including associated meta data such as:
  • Number of characters.
  • Distinct character count.
  • Start & end character.

While several properties and meta-data is not used in the application it is available for exploration. Other properties such as character lengths are used in Python list-comprehensions to apply field length filtering.

Why build and publish this example application code?

On the surface this example may seem rudimentary; and it is, up to a point. However, I assure you it is not when considering project deadlines delayed due to poor testing.

I wrote this code for testing an application where the user interface had requirements to accept and parse CSV data and also address entry features. Additionally, the features were new and no historical data existed for use in testing.

When you have been doing this work for as many years as me; you quickly notice when requirements testing is a bottleneck. I can laugh about it now that the project is over but here is a brief overview of what was happening.

As I listened around the room; I heard mashing of keys, grunts, and groans. Anyone who writes software knows what I mean! The ra-t-a-tat-tat of the keyboard as people manually typed in bogus data, with flailing fingers randomly mashing keys. Rightfully, people were annoyed, because there was no defined requirements for data, what it was supposed to look like, how long each data-piece should be etc. As a result, manual testing ensued which quickly became a bottleneck because; it is extremely difficult to verify bogus data, randomly spewed from a keyboard. Errors were going unnoticed.

I was not the manager of this project, just one, of several, unfortunate souls tasked with testing software not ready due to missing requirements. I raised my concerns about testing this way and the impact it was having on the project. The software manager made the decision to continue and so I needed a better way to test. Whether or not I agreed didn’t matter; as an employee I had very little say or influence other than refusing to do the work. Obviously, I didn’t want to loose my job over a poor management decision, so, I moved forward.

Why I built and published this software was more about helping other people. Sharing and publishing a reusable project for people on other projects, will hopefully, help them to learn Python and give them a hand in testing. Or, perhaps, just a different perspective on how to approach a difficult situation.

What did this software accomplish?

Often, overworked software managers or perhaps uncaring managers; forget about or disregard, the people aspects of a project. Poor requirements frustrate people who care about the quality of their work because final results, are at best, questionable. The questionable nature of working this way affects people differently. Some people rush through work, others drag their feet until outcomes becomes clearer; worse yet, people stop caring or quit. This has a huge impact on the project time-lines and final product quality.

This application relieved some stress, reduced frustration, and helped people focus; because mashing of keys to produce bogus data, was one less sore-point, in a difficult situation. People were appreciative because the software quickly exposed quality issues making it easier to adjust work-flow or raise issues to management. For example, the extremely minimal data validation present was quickly exposed as remarkably weak and useless.

Technical achievements were as follows:

  • Readable data made tracking and validating points within the work-flow possible.
  • Variable content made it easier to test length, formatting, and duplication concerns.
  • Sequential data made it possible to spot truncated or missing data.
  • Mock addresses with real name, city, state, and zip data demonstrated which fields needed additional coding.
  • Variable field lengths made it possible to validate CSV parsing of individual data components for length and content.
  • Configurable settings made it easier to do negative testing.
  • Dual output to file or terminal window made it easier to view, tweak, and adjust to changing test conditions.
  • File output to CSV made it possible to generate large data-sets useful for database insert testing and query validation.
  • The ability to produce large data-sets was useful in memory, storage, and performance testing.

Hey wait a minute! Didn’t you say there was little to no requirements?

Yes, despite the lack of requirements, my many years of experience was helpful to establish a baseline of missing requirements. Once light was shed on the quantity of missing requirements the business manager was able to pause and redirect the project with minimal impact on the project schedule. The software manager wasn’t happy to have project time-lines moved, but was appreciative of having information for the business manager to remediate. My team members were happy the project would come back with well defined requirements, and a handy reusable tool; ready for retesting, was a time-saving-bonus.

An Aside Note

I created this project on my own time and did not use any of my employers time to build this software. My employer benefited from this work because I was able to use it with positive impact with no expense to my employer.

I enjoy helping people, so, I work on projects in my spare time that may be helpful to others. When I can; I share theses projects as open-source to benefit others. If you learn, like, or use this project please give me a shout-out to let me know. Feedback, positive, or negative is welcome.

Where to Get This Software

On GitHub
On Python PyPi