SAS Made Simple: ANYCNTRL & NOTCNTRL Regex Alternatives for SAS

A control character is a character that does not represent a traditional printable character but is intended to act as a particular action. The SAS Help Center has a useful list of the ASCII characters and their associated decimal values that helps to demonstrate what is meant by a control character. If control characters are having an impact on any of your downstream systems post-ingestion, then being able to identify where they occur should be considered as a standard procedure. This control character check in a dataset where they are not expected was a standard function of our general remediation processes when implementing data quality solutions.

SAS offers two standard functions to help identify these, ANYCNTRL and NOTCNTRL each respectively intended to identify if a control character exists (or does not exist) within a character string and the first position at which that character is found. A further expanded table of all ASCII values and their corresponding hex & dec fields can be found on freecodecamp.org which is more in line with the results provided in the solution below.

Though as the dataset sizes we began to process had grown, ANYCNTRL became too time-consuming for our purposes, and we required a more flexible standard approach to dealing with control characters in order to improve performance. One such solution was to identify without using ANYCNTRL via the use of regex.

There are two approaches we can use for the purposes of checking for this. The code below is a modified version of the SAS example provided in NOTCNTRL but with minor modifications:

data test; 
      do dec=0 to 255;
          byte=byte(dec);
          hex=put(dec, hex2.);
          anycntrl=anycntrl(byte);
          anycntrl_method1 = prxmatch('/[\x00-\x1F\x7F]/', byte);
          anycntrl_method2 = prxmatch('/[[:cntrl:]]/', byte);
      output;
  end;


proc print data=test;
run;

The logic behind these achieves the same goal, and the choice of which to use comes down to the need for readability or consistency. [:cntrl:] is a POSIX character class that specifically represents control characters. The second approach ‘\x00’ to ‘\x1F’ represents ASCII characters with hexadecimal values 00 to 1F. This would cover control characters from 0 to 31. Still outstanding is the “Delete” characters which is why the additional ‘\x7F’ pattern amendment also needs to be included. The cntrl_method2, which uses the [[:cntrl:]] regex argument will allow for the dec 255, hex = FF value, which is ÿ, or  “Latin small letter y with diaeresis”. This will need to be considered if using this solution for identification.

Hopefully this gives you a starting point on how to use Regex to identify control characters within your data using SAS. For more SAS tips have a look through the rest of our ‘SAS Made Simple’ series.

Next
Next

What Does AWS DataZone Do?