上QQ阅读APP看书，第一时间看更新

Cleaning techniques

Typically, the data cleansing process evolves around identifying those data points that are outliers, or those data points that stand out for not following the pattern within the data that the data scientist sees or is interested in.

The data scientists use various methods or techniques for identifying the outliers in the data. One approach is plotting the data points and then visually inspecting the resultant plot for those data points that lie far outside the general distribution. Another technique is programmatically eliminating all points that do not meet the data scientist's mathematical control limits (limits based upon the objective or intention of the statistical project).

Other data cleaning techniques include:

Validity checking: Validity checking involves applying rules to the data to determine if it is valid or not. These rules can be global; for example, a data scientist could perform a uniqueness check if specific unique keys are part of the data pool (for example, social security numbers cannot be duplicated), or case level, as when a combination of field values must be a certain value. The validation may be strict (such as removing records with missing values) or fuzzy (such as correcting values that partially match existing, known values).
Enhancement: This is a technique where data is made complete by adding related information. The additional information can be calculated by using the existing values within the data file or it can be added from another source. This information could be used for reference, comparison, contrast, or show tendencies.
Harmonization: With data harmonization, data values are converted, or mapped to other more desirable values.
Standardization: This involves changing a reference dataset to a new standard. For example, use of standard codes.
Domain expertise: This involves removing or modifying data values in a data file based upon the data scientist's prior experience or best judgment.

We will go through an example of each of these techniques in the next sections of this chapter.