Computer Science II

Categories:

Recommended

Relational Databases

3.1 Introduction

Consider the data in Table 3.1 for a course enrollment system.

The data in this table is essentially a flat file. Though we’ve presented it in a nicely formatted table, this data is essentially no different from a Comma-Separated Value (CSV) file in which each line represents a record and individual column values are delimited with a comma. The same data as a CSV file is presented in Figure 3.1. Flat file representations of data are called such because they take related pieces of data, such as which courses a student is enrolled, and flatten them into a single table representation. Even ancillary data that have a completely different relationship (such as emails) is included in the “flattened” representation.

Flat file representations can lead to a lot of problems, limitations and can lead to data anomalies. To motivate the use of a proper database system, we identify several of these here.

  • There is no semantic meaning to the data fields. Though the first row contains headers of what each column represents, there is no way to enforce those associations. Since this data format is merely a plaintext file, we could accidentally swap columns, omit columns or provide more tokens than expected. Alternatively, some data representation formats are designed for Electronic Data Interchange (EDI). In order to transfer data between two different systems (that may be on different hardware, use different data models or are written in different languages) data needs to be translated (a process called serialization) into a universal format in which each data field is semantically marked up to indicate what it represents. Two examples can be found in Figure 3.2 which contains samples of the same enrollment data represented as JavaScript Object Notation (JSON) and Extensible Markup Language (XML) respectively.

    Figure 3.1: CSV Formatted Data

  • There is a lot of repetition of data. Every record that models a student enrolled in a course repeats all the student’s data including their name, ID, email, etc. Likewise, the course information is also repeated over every enrollment record.
  • Repeated data can lead to different and conflicting representations and for- matting problems. In some of the student records the name is similar, but represented differently, some are last name first, others are first name first, others have a middle initial or only have an initial for the first name, one doesn’t use capitalization, etc.
  • There is missing or incomplete data. The first and last record do not have any course data and they represent this missing data inconsistently: with empty string values in the first record, a placeholder value (“NONE”), and a null value. These may be records that once held data but that was changed. However, so as not to lose the (still relevant) email data, they were modified.
  • There is inconsistent data. The third from last record for example has a different NUID for Tom Waits as it does from the other records with a similar name. This NUID is also associated with Lou Reed in the final two records. Without any rules to enforce consistency, there is no data integrity and incorrect data like this is allowed to occur.
  • There are organizational and efficiency issues with processing a single data file. Any aggregate reports, such as producing a roster for a particular class or a schedule for a single student, would require processing every record in the entire file. Similarly, updating a single record becomes extremely difficult. For example, if one student change their email, we have to find and replace every instance with several contingencies (dealing with multiple email records) and side effects to deal with.
  • Keeping data in a single file also has concurrency Issues. If we want to design a system to be multi-user or even simply multithreaded or have several programs access and process our data then each process or thread may have to do so by placing a file lock on the data file. This precludes the possibility of concurrent and parallel processing, severely limiting a system’s efficiency and essentially making it a single user system.

The solution to (most) of these problems is to use a proper Relational Database Management System (RDMS). A relational database stores data in tables. Tables delineate data in columns and rows. Each column holds a specific type of data (integers, strings, etc.) that define fields of records. Each row in a table corresponds to a single record.

Category:

Attribution

Dr. Chris Bourke (2019), Computer Science II, URL: https://cse.unl.edu/~cbourke/ComputerScienceTwo.pdf

This work is licensed under Attribution-ShareAlike 4.0 International License  (https://creativecommons.org/licenses/by-sa/4.0/).

VP Flipbook Maker

Display and share your work with others by VP Online flipbook maker! It is a professional tool for flipbook converting and creating. Try it now!