The Design of A Custom Database System for Storing and Accessing
Point Source Emission Test Data and Associated Process
Susan E. Gossman
Systems Analyst - Gossman Consulting, Inc.
David G. Gossman
President - Gossman Consulting, Inc.
Presented at the AWMA International Specialty Conference on Waste Combustion in Boilers and Industrial Furnaces April, 1997
BIF and incinerator facilities continue to develop large numbers of stack test reports covering multiple point sources, various emission parameters, and differing operating conditions. When corporate management or a plant manager ask, "What are our mercury emissions like?" or "Under what conditions do we see maximum particulate emissions?" a thorough answer can take many man-hours of emission report review. A custom database has been designed that can shorten this time requirement to a few minutes with a Windows based, user friendly interface. The systems based approach to designing the database structure is described and example database inquiries are illustrated.
An effective database is one which has been engineered to meet the actual present and future needs of a company. It contains accurate, vetted data which can withstand rigorous scrutiny. The interface must be usable and understandable by all persons which who will be directly accessing the database and must provide reports and exportable data which meets the needs of "arm's length" users. The database needs a mechanism to guard against corruption of existing data and to ensure that data added in the future will meet quality assurance objectives. The database must be designed and documented so that it can grow and be easily modified as the company grows or determines additional future needs.
The process of physically designing the database structure , its internal tables and the linkage of table keys, is frequently erroneously considered to be the full substance of "designing the database." Optimal design is actually a reiterative process which moves back and forth between design and implementation with real data. Be prepared to rotate repetitively between analyzing who wants what, designing the table structure, collecting and entering data and building forms, queries and reports. Quality assurance must both be a continuing process performed at every step along the way and also have adequate time set aside in the schedule for dedicated correctness testing after the database is nominally completed.
Analyzing User Needs
The most critical step in developing a database for any application is to develop a clear and accurate understanding of the needs the database is being created to fulfill. A company will have many "internal customers" of a project of this nature. Without an accurate understanding of what will make the internal customer happy, the completed database has virtually no chance of making those internal customers happy.
An internal customer may be executive personnel who, for example, may need to have data available to make project feasibility assessments, respond to regulatory challenges or assess corporate policy. Other internal customers may be administrative personnel who need "to demonstrate that ... "; "to document that ... " or to complete routine reports or government required filings. Operations and technical personnel need access to accurate, correct scientific data to perform their operational tasks and to make decisions necessary to keep the company technologically up-to-date. Information services will have needs concerning compatibility and staffing issues. Legal counsel will find direct access to all of the company's emissions data especially valuable in responding to challenges within today's regulatory environment.
Interviewing each of the affected internal customers at the beginning of the database project will dramatically improve the perception of the success of the completed project. These interviews need to be highly specific, detailing how the current process is performed, which currently existing forms and reports will be generated by the new database and which new reports need to be created. Obtain blank copies of all forms for which the database will be used, highlighting all fields which are expected to come from the database. Ask for specific instances for which the database would have been used in the past, had it existed, and for any specific instances for which their is an anticipated future use of the database.
An emissions inventory database typically addresses needs such as regulatory compliance issues, community relations programs, and process control and evaluation. It can assist lobbying efforts relative to proposed regulations and speed the efficiency of data retrieval from test results. In addition, a homogenous emissions database can make possible more accurate plant comparative studies.
Along with determining the needs the database will be designed to meet, determine the existing data sources which are intended to be included in the database. These sources may include internal test reports and/or results, currently existing administrative or operational data, and publicly published data. This data may be in printed form in test reports, publications or operational logs, in existing spreadsheets, or in other pre-existing databases.
Perform a detailed comparison of the database output needs with the data available for input. If one of the internal customers wants data to come out of the database which is not currently available to put into the database, begin the process of identifying and acquiring this "missing" data, or communicate the unavailability of this data to the internal customer. Likewise, identify data slated to be put into the database which no one has expressed a need to see come out of the database. This group of data can provide an extremely valuable clue to otherwise invisible database needs. There is a good chance someone needed this data at one time. Ferreting out why this data was originally collected early in the database design process can have some unexpected benefits in avoiding late project surprises.
Designing the Table Structure
It is beyond the scope of this paper to attempt to teach the nuts and bolts of designing database tables and linkage structures. This task is often out-sourced or may be performed by someone from the Information Services (IS) group. The database project manager, however, may find it useful to have some insights into the process. Following are some characteristics of a well designed database.
Each table must contain data of a common content consistent with the name of the table. All data within a record within a given table must remain unchanged across the time frame that the record applies to. A common example of this involves company and plant identifying data. Strangely enough, the company name does not belong in a table of plant data - the plant address will not change over the life of the record, but the company owning the plant very well might change.
Some data in the database will be used to select other data. For example, the plant State field may be used in answer to the request "Find all mercury analyses performed in Illinois." Other data may or may not ever be used in the selection process. Will anyone ever ask "Find all analyses performed by the technician with initials 'XYZ'?" Any particular item of data in the database which could be used in the selection process should only physically appear once. The database should be configured so that, for example, the same phrase such as "Raw mill off" will not appear repeatedly. It could be structured as a selection in a pre-defined look-up table, or as a "yes-no" field of its own.
The design details should be consistent throughout. The design of table linkages, or keys, should be consistent across the database. Table, field and key names should follow consistent conventions. Field definitions, such as data type and length should follow uniform conventions.
Fields should not be forced to contain multiple versions of data. If some reports provide the run start time and the total run length, and others provide the run start and run stop time, it is preferable to have three fields; run start time, run stop time, and total run length; rather than try to somehow combine the run stop time and the run length.
A text field for comments can appear in a judicious selection of tables. There is a fine line here because it is important that information not be routinely placed in a comment field. At the same time, there are occasional instances when there is a unique and legitimate reason for placing a comment in a record.
Implementing the Database Design
During the development of the database, attempt to keep one's options open. At times, decisions must be made as to whether to handle data one way or another way. When the "right" choice is insufficiently clear, retaining sufficient information to commit to the choice at a later time is advantageous. For example, assessing whether to keep one table intact or to split it into two tables may have nearly balanced pros and cons. In that case, try leaving it the way it is (the route of least effort), make a note of the question, and defer answering the question until either one has enough new information to make a clear decision or the decision has become the highest priority next task to accomplish.
It often takes less time and money in the long run to give oneself permission to not make every design decision perfectly the first time. Plan to evaluate for design "mistakes" and to approach the design in a reiterative fashion. Retain fields which permit easy restructuring of tables. Practice querying techniques which facilitate table restructuring. These tips can be highly effective in avoiding bogging down database development.
After the database is completed, it can only be verified to be correct if there is some means of relating the resultant reports with the original data. Because errors can be introduced at every phase, from sample collection, to laboratory analysis, to typographical error, to database table mis-linking, it is critical to maintain tracking records. These records must permit tracing every data point in the database directly back to the most original source available, ideally even back to original handwritten field records.
Collecting and Entering the Data
The correctness of the final product will depend in large part on the caliber of the person(s) assigned to collect the data. This person must be able to assess the quality and correctness of the laboratory data. They must understand the context of the data and be able to separate the important from the unimportant. As discussed under the topic of designing database tables, data can crop up during the data collection phase of which the database designers were unaware. The person performing data collection must exercise good judgement in handling these discrepancies.
This person must determine the test or report types which contain data which will be included in the database. The location of complete copies of these reports must be ascertained. (Plant copies often are only partial copies. External laboratory QA/QC documentation especially is often not available in plant copies of reports.)
Create a master list of reports, including at a minimum the facility name, the date of the test, where the original report is located and what major types of data they contain: metals, dioxins, organics, etc. When collecting the data, it is desirable to photocopy the relevant pages, label them so that they can be traced to the original report and create a file which will be retained as part of quality assurance. Highlight desired data on and then perform keyboard or OCR entry from these photocopies. After data entry, printouts of the entered data (used for proofing) can be filed with the photocopies. A clearly labeled manila file should then exist for each item on the master list of reports.
Data entry into the computer and placing the data into the database are two distinct tasks. The two primary options for data entry are typing (somewhat freeform) into a spreadsheet or typing directly into the database. If data is entered into a spreadsheet, it must then be verified for typographical correctness and then be imported into the database. If data is entered directly into the database, it must still be verified for typographical correctness. Either way, a printout should be made containing only the keyed-in data for each individual test report.
While it is tempting to imagine that it is more efficient to enter the data directly into the database, this is usually not the case. The initial database structure typically is created from pristine, blank copies of what the test reports are supposed to contain. The actual test reports will have notations in the margins, supplemental tables added, multiple values reported in single fields and a host of other unanticipated variations. This presents a bit of a chicken and egg conundrum. Go through all the test reports in detail before beginning the database design or design the database before going through all the test reports in detail?
A workable balance is to perform a rough database design from what the reports are anticipated to look like. Select a few sample test reports to use in this process. Include a report which is thought to be typical, one which is as complex or difficult as possible and one which is minimal. Use these to determine which data to definitely include and which data to definitely exclude on an initial pass. The person collecting the data can then more easily photocopy relevant pages, flagging those pages which contain data which might otherwise fall through the cracks - the data which is not definitely to be included or excluded. Following data collection (or perhaps intermittently during collection) the design process can be revisited, taking into account this newly discovered information.
In the same manner, effective progress in data entry can be more readily accomplished by entering data into spreadsheets. This permits the data entry personnel some flexibility while the database is still taking shape. In essence, during the early development phase, it is more preferable to "fit the database to the data" instead of "fitting the data to the database."
As data is entered, retain with the data sufficient information to be able to trace backwards to the source of the data. Admittedly, in many circumstances, knowing the plant name, the report type and the date of the test can get the user to within a dozen photocopy sheets. Thumbing through these dozen sheets is not problematic. In other instances, when the plant name, report type and date gets one to several hundred pages, record the page number in the database. If data comes in from large spreadsheets, hundreds or thousands of lines long, add a column to the spreadsheet containing a unique value for each line and record that value in the database. Choose and record sufficient information so that one can reliably locate the piece of paper or the spreadsheet cell for every data point recorded in the database.
Data is often summarized in test reports. Generally, it is desirable to recreate the summary when querying the database rather than duplicate the test report data (once as the individual data point and then again as part of a sum or average.) Nevertheless, it can be valuable to the quality assurance process to initially include summarized data in the database. At the end of the quality assurance process, these fields can always be removed or hidden if management wants to ensure that only original, non-summarized data is used in any queries.
The Question of Units
Data is reported in a variety of units, bases and corrections. Emissions data, especially, can be reported in both a concentration and a mass flow basis and may or may not be corrected for moisture, temperature, barometric pressure or stack gas concentrations, most notably oxygen. While it may be tempting to convert to common units on the fly, it is important to enter these data values in their originally reported units. Later conversion, if any, must be recorded in some fashion, preferably by recording the formula in a spreadsheet or in the database. Avoid making a conversion with a calculator and then entering only the resultant number into the computer. This would combine at least three sources of error, typographical keystroke error, use of the wrong formula and loss of precision from the use of inconsistent number of significant digits in conversion constants.
The original data values and the original units should be recorded, along with all data used in corrections and conversions of basis. For emissions data this would include stack moisture, stack temperature, stack gas flow rate on a dry basis and stack oxygen concentration. Determine whether to record data such as actual stack gas flow rate and stack carbon dioxide concentrations. If the units on this supporting data are consistent, it is reasonable to include the unit in the label of the field, i.e., "Stack Temp in °F". Be prepared, however. Somewhere, someone will have recorded stack temperature in Renkines. If the decision is made to make a rare conversion such as this on the fly, still record the original value, the original units and the formula used for conversion in a comment field.
Building Forms, Queries and Reports
Initial reports created with the database serve a number of functions. Some reports will have been defined during the initial analysis of "who wants what." Reports should also be created to reconstruct the format of the original data to provide an additional opportunity to perform quality assurance. The creation of some speculative queries and reports to demonstrate how data can be extracted and utilized is both a valuable training tool and can help some company personnel visualize how they could use the database.
To reduce the opportunity for later error, pre-build sample or template queries to perform the most commonly used unit, basis and correction conversions. Conversion constants can be stored in a table of units. Store sufficient significant digits in conversion constants so that the conversion itself will not unduly limit the precision of the result.
During the initial analysis phase, the question of who would be accessing the database was explored. These personnel's needs will direct in large part whether or not a large repertoire of predefined reports should be created. Some usages will involve training personnel to make their own queries and reports. Other usages will require designing reports for very specific purposes but will limit training only to how to run particular reports each month, or each year.
Quality Assurance - Again
After the database is nominally completed, it is necessary to devote time to assessing the accuracy of the database. One may imagine that the database has a "confidence factor." That confidence factor is only as great as actual testing of the database supports. Each of the following activities will increase the confidence factor.
Make a count of the number of records in each table. Are they reasonable? Are the proportions of the counts reasonable, i.e., six times as many reports as plants, 200 times as many metal emission measurements as reports, roughly twice as many kilns as plants, etc.?
Compare individual tables to other compilations of industry data. For example, print out all of the plant data and compare it to a separate listing of plant data. A listing of all the reports which were included can be circulated among plant managers to verify no major reports were overlooked.
Check linkages between tables. List, for example, all plants with reported dioxin values, or with metal emission values. Check these against the master list of reports. Or list the minimum and maximum stack gas flow rates of each stack to verify they are appropriate. These are samples of queries which stretch across several tables. Make as many of these queries as necessary to test every table linkage.
Assess the numerical accuracy of the database. Create sums or averages of data where possible to compare against sums which were in the original reports. For example, dioxin values may have been summed in original test reports. Many emission values may have been averaged across a group of runs.
Check tables for internal consistency. Determine the range of values present in each field and verify that they make sense. Determine what proportion of the values in a given field are missing and assess if that amount is reasonable.
Endeavor wherever possible to assess the data in ways that make the data feel meaningful. Go right into tables and scan the data. Even if a table contains tens of thousands of records, develop a feel for that data. Sort it, check minimums, maximums, averages.
Trace individual data points back to original reports. Use the backward links which were built in as the database was constructed to trace any suspicious findings back to their source. Select a record from each table and trace that record back to paper, proofing every field in the record.
Maintenance of the Database
Once the database has been initially created, future maintenance will come in one of two forms. First, occasional corrections of individual data points may occur and, secondly, new test reports will become available which may need to be added. During the original design of the database, it is appropriate to assess how these later corrections or additions will be handled.
The primary consideration is maintaining the level of accuracy of the data. Once the data has been verified to be correct, at least to some level of confidence, it is important that not just anyone can "fix" something. It is important that any "fix" be carefully verified to be sure that it is, indeed, warranted, and not just the result of a misunderstanding on the part of the operator as to the meaning of the data. In addition, the database is more defensible if all persons able to modify the database are identified. It may be desirable to include audit and security functions to insure data integrity.
Corrections and major additions should be made separately from the master database and undergo a predetermined scrutiny to verify their accuracy. Major additions, especially, should be made in a separate, identically structured database and be completely verified before being added to the existing master database. Backups, obviously, need to be performed and retained of the original database before additions are performed.
A well designed, carefully created emissions database can be an invaluable aide in making business decisions. Facilities today have begun to amass such an amount of emission data in far-flung uncoordinated reports that it is no longer feasible to attempt to thumb through existing reports to extract data on a piece-meal basis. The increase of efficiency in accessing existing emissions data can make the creation of a database a significant cost-saving decision. Compiling a central core of data, all of which has undergone common extraction, assessment and homogenization, makes this expensive original data available to decision-makers in a significantly more usable form.