ときどきの雑記帖 RE* (新南口)

Proving Grounds of the Mad Overlord

October 12, 2022

梨

あきづき
南水

数学セミナー

今日が発売日のはずだけど買いに行った書店でラス1だった。今月何かあったっけ?

いつもなら10冊前後あるんだけど。

消毒用アルコール

ぼつぼつ撤去されてますな。

awk

以前に

「種明かし(謎解き)」は気が向いたら (他に書くことがなかったら) 書く😄

と書いたことの続き😄

ふと気になって例の彼が作っているawkライクな処理系のgithubリポジトリを見に行ってみると…

kat0h/rusty_awk: 作りかけ

POSIXで規定されたAWKの実装です。 POSIXの2017年版を参照していますが、一部異なる仕様があります。

数字と文字列を比較した場合、数値は文字列に変換されて比較されます

POSIXの仕様は誤っています。nawk/gawkと同じ挙動です

あちゃーそう解釈しちゃったのか＞POSIXの仕様は誤っています

この部分のgawkの仕様はgawkがPOSIXにあわせていて同じはずなんだけどね (2.15.6辺りで変わっていたと思う(要確認))。

んで、この「勘違い」は結局のところstring value that is a numeric string の部分の解釈を間違っているのが原因だと思うんだけど、ついったアカウントに鍵かけちゃってるな (鍵をかけていなくてもメッセージを送ったりはしないけどね)。

まあそれはいいや。

どの辺が「間違っている」のかを書こうと思うのだけど、 opengroupのページを見る前に gawkのリファレンスマニュアルを使って (こっちの方がわかりやすいと思うので) 説明してみる。

おまけでやっつけの日本語訳もつけとく😄

リファレンスマニュアルでこの動作について解説しているのはこの辺。 6.3.2.1 String Type versus Numeric Type - The GNU Awk User’s Guide

6.3.2.1 String Type versus Numeric Type

Scalar objects in ‘awk’ (variables, array elements, and fields) are dynamically typed. This means their type can change as the program runs, from “untyped” before any use,(1) to string or number, and then from string to number or number to string, as the program progresses. (‘gawk’ also provides regexp-typed scalars, but let’s ignore that for now; *note Strong Regexp Constants::.)

awkにおけるスカラーオブジェクト(変数、配列の要素、フィールド)は動的に型付けされます。これはつまり、スカラーオブジェクトの型はプログラムのprogressesに伴って実行時にまったく使われていない状態の"untyped"(注1)から数値や文字列へ変わったり、さらに文字列から数値あるいは数値から文字列のように変わる可能性があるということです。 (gawkにはregexp-typed scalarsもありますが、ここでは無視します。 regexp-typed scalarsについては Strong Regexp Constants を参照してください)

You can’t do much with untyped variables, other than tell that they are untyped. The following program tests ‘a’ against ‘""’ and ‘0’; the test succeeds when ‘a’ has never been assigned a value. It also uses the built-in ’typeof()’ function (not presented yet; *note Type Functions::) to show ‘a’’s type:

型がつけられていない変数(untyped variables)に対して行えることは、その変数に型がついていないことを判別することのほかは多くありません。以下に示すプログラムでは’a’と’""’ や ‘0’とを比較していますが、このテストは’a’がそれまでに値を代入されていない場合に成功します。このプログラムでは’a’の型を表示するのに (まだ説明していない)組込み関数の’typeof()‘も使っています (typeofについては Type Functions を参照してください)。

$ gawk 'BEGIN { print (a == "" && a == 0 ?
> "a is untyped" : "a has a type!") ; print typeof(a) }'
-| a is untyped
-| unassigned

A scalar has numeric type when assigned a numeric value, such as from a numeric constant, or from another scalar with numeric type:

スカラーは定数(numeric constant)や numeric typeである別のスカラーのような numeric valueが代入されたときには numeric typeとなります。

$ gawk 'BEGIN { a = 42 ; print typeof(a)
> b = a ; print typeof(b) }'
number
number

Similarly, a scalar has string type when assigned a string value, such as from a string constant, or from another scalar with string type:

同様に、スカラーはstring constantや string typeである別のスカラーといった string valueが代入されたときには string typeとなります。

$ gawk 'BEGIN { a = "forty two" ; print typeof(a)
> b = a ; print typeof(b) }'
string
string

So far, this is all simple and straightforward. What happens, though, when ‘awk’ has to process data from a user? Let’s start with field data. What should the following command produce as output?

ここまでの話は単純でstraightforwardなものでした。では、ユーザーからのデータを処理するときになにが起こるのでしょうか? まずはフィールドデータから始めましょう。次に示すコマンドはどのような出力をすべきでしょうか?

echo hello | awk '{ printf("%s %s < 42\n", $1,
                          ($1 < 42 ? "is" : "is not")) }'

Since ‘hello’ is alphabetic data, ‘awk’ can only do a string comparison. Internally, it converts ‘42’ into ‘“42”’ and compares the two string values ‘“hello”’ and ‘“42”’. Here’s the result:

これは内部的には(数値の)‘42’を(文字列の)’“42”‘に変換してから ‘“hello”’ と ‘“42”‘という二つの文字列を比較します。そして結果はこうなります:

$ echo hello | awk '{ printf("%s %s < 42\n", $1,
>                            ($1 < 42 ? "is" : "is not")) }'
-| hello is not < 42

However, what happens when data from a user looks like a number? On the one hand, in reality, the input data consists of characters, not binary numeric values. But, on the other hand, the data looks numeric, and ‘awk’ really ought to treat it as such. And indeed, it does:

さてここで、ユーザーから入力されたデータが数値のように見える場合には何が起きるでしょうか? 言い換えるとバイナリ表現の数値ではなく文字の並びの入力データであるけれどもそれが数値のような外見を持ったデータであった場合、 awkはそれを数値であるかのように扱います。 And indeed, it does:

$ echo 37 | awk '{ printf("%s %s < 42\n", $1,
>                         ($1 < 42 ? "is" : "is not")) }'
-| 37 is < 42

Here are the rules for when ‘awk’ treats data as a number, and for when it treats data as a string.

awkがデータをいつ数値として扱うのかまた、データをいつ文字列として扱うかを決める規則があります。

The POSIX standard uses the term “numeric string” for input data that looks numeric. The ‘37’ in the previous example is a numeric string. So what is the type of a numeric string? Answer: numeric.

POSIX standard では数値のような外見を持つ入力データ (input data that looks numeric)に対して “numeric string"という用語を使っています。前述の例にあった'37’はnumeric stringです。それではnumeric stringの型はいったい何でしょうか? その答えは、数値(numeric)です。

The type of a variable is important because the types of two variables determine how they are compared. Variable typing follows these definitions and rules:

二つの変数の型によってどのように比較するかが決まるので変数の型は重要です。変数の型付けは以下に挙げた定義と規則に従います:

A numeric constant or the result of a numeric operation has the “numeric” attribute.

A string constant or the result of a string operation has the “string” attribute.

Fields, ‘getline’ input, ‘FILENAME’, ‘ARGV’ elements, ‘ENVIRON’ elements, and the elements of an array created by ‘match()’, ‘split()’, and ‘patsplit()’ that are numeric strings have the “strnum” attribute.(2) Otherwise, they have the “string” attribute. Uninitialized variables also have the “strnum” attribute.

Attributes propagate across assignments but are not changed by any use.

numeric constant、およびnumeric operationの結果は"numeric” attributeを持ちます
string constant、およびstring operationの結果は"string" attributeを持ちます
フィールド、getlineによる入力、 ‘FILENAME’, ‘ARGV’ の要素、`ENVIRON’の要素、それに加えて’match()’, ‘split()’, ‘patsplit()’ によって生成された配列の要素でnumeric stringsであるものは “strnum” attributeを持ちます(注2)。それ以外のものは"string" attribute を持ちます。
attributeは代入を通じて伝播しますが、使うことによって変化することはありません。

The last rule is particularly important. In the following program, ‘a’ has numeric type, even though it is later used in a string operation:

最後のルールは特に重要です。例を挙げると次のプログラムで’a’はnumeric typeですが、それはプログラムの後ろの方にある文字列操作で使われたあとでも変わりません。

BEGIN {
    a = 12.345
    b = a " is a cute number"
    print b
}

When two operands are compared, either string comparison or numeric comparison may be used. This depends upon the attributes of the operands, according to the following symmetric matrix:

二つのオペランドを比較するとき、文字列比較か数値比較のいずれかが行われます。どちらの比較が行われるかは、以下に示す symmetric matrixに従って二つのオペランドのattributeによって決定されます:

       +----------------------------------------------
       |       STRING          NUMERIC         STRNUM
--------+----------------------------------------------
       |
STRING  |       string          string          string
       |
NUMERIC |       string          numeric         numeric
       |
STRNUM  |       string          numeric         numeric
--------+----------------------------------------------

The basic idea is that user input that looks numeric–and only user input–should be treated as numeric, even though it is actually made of characters and is therefore also a string. Thus, for example, the string constant ‘" +3.14"’, when it appears in program source code, is a string–even though it looks numeric–and is never treated as a number for comparison purposes.

基本的な考え方はこうです。数値のように見えるユーザーからの入力は、それが実際には文字から構成されている文字列であったとしても ユーザーからの入力であるときに限って数値として扱うべきである。したがって、たとえば’" +3.14"‘というstring constantがプログラムのソースコードの中に現れた場合は、たとえ数値のように見えたとしてもそれは文字列なのです。そして、それが比較の際に数値として扱われることは決してありません。

In short, when one operand is a “pure” string, such as a string constant, then a string comparison is performed. Otherwise, a numeric comparison is performed. (The primary difference between a number and a strnum is that for strnums ‘gawk’ preserves the original string value that the scalar had when it came in.)

まとめると、オペランドのいずれかがstring constantのような “pure"な文字列であった場合には文字列比較が行われ、そうでない場合には数値比較が行われます。 (数値とstrnumとのprimaryな違いは、 gawkはstrnumに対してはそのスカラーが持ち込まれた際のoriginal string value を保持しているという点です)

This point bears additional emphasis: Input that looks numeric is numeric. All other input is treated as strings.

This point bears additional emphasis: 数値のように見える入力は数値です。それ以外の入力は文字列として扱われます。

Thus, the six-character input string ’ +3.14’ receives the strnum attribute. In contrast, the eight characters ‘” +3.14"’ appearing in program text comprise a string constant. The following examples print ‘1’ when the comparison between the two different constants is true, and ‘0’ otherwise:

したがって、’ +3.14’ という6文字の入力文字列はstrnum attributeを持つことになります。一方で、プログラム中に現れた’" +3.14"‘という8文字の並びはstring constantとなります。以下に示す例では、二つの異なるconstantsの比較結果が真であるときに1を出力し、それ以外では0を出力しています:

$ echo ' +3.14' | awk '{ print($0 == " +3.14") }'    True
-| 1
$ echo ' +3.14' | awk '{ print($0 == "+3.14") }'     False
-| 0
$ echo ' +3.14' | awk '{ print($0 == "3.14") }'      False
-| 0
$ echo ' +3.14' | awk '{ print($0 == 3.14) }'        True
-| 1
$ echo ' +3.14' | awk '{ print($1 == " +3.14") }'    False
-| 0
$ echo ' +3.14' | awk '{ print($1 == "+3.14") }'     True
-| 1
$ echo ' +3.14' | awk '{ print($1 == "3.14") }'      False
-| 0
$ echo ' +3.14' | awk '{ print($1 == 3.14) }'        True
-| 1

You can see the type of an input field (or other user input) using ’typeof()’:

入力フィールド(もしくはその他のユーザーからの入力)の型は ’typeof()‘を使って確認できます。

$ echo hello 37 | gawk '{ print typeof($1), typeof($2) }'
-| string strnum

---------- Footnotes ----------

(1) 'gawk' calls this "unassigned", as the following example shows.

(2) Thus, a POSIX numeric string and 'gawk''s strnum are the same thing.

脚注

(1)

gawkではこれを、続く例にあるように"unassigned"と称しています。

(2)

したがって、POSIXで言うnumeric stringとgawkのstrnumとは同じものです。

さてここで、opengroupの awk を改めて見てみると問題の

Comparisons (with the ‘<’, “<=”, “!=”, “==”, ‘>’, and “>=” operators) shall be made numerically if both operands are numeric, if one is numeric and the other has a string value that is a numeric string, or if one is numeric and the other has the uninitialized value. Otherwise, operands shall be converted to strings as required and a string comparison shall be made as follows:

より前の部分に次のような部分が見つかる。

A string value shall be considered a numeric string if it comes from one of the following:

Field variables

Input from the getline() function

FILENAME

ARGV array elements

ENVIRON array elements

Array elements created by the split() function

A command line variable assignment

Variable assignment from another numeric string variable

and an implementation-dependent condition corresponding to either case (a) or (b) below is met. 以下略

つまりはそういうことです。

ん、A command line variable assignment ってgawkのリファレンスマニュアルには該当するものがなかったような? (プログラムではちゃんと処理しているっぽい。リファレンスマニュアルの記述漏れ?)

おまけ

該当箇所、次期POSIXのIssue8で修正されているようです。（ちゃんと読んでいないので間違っているかもしれません）https://t.co/aWBWYSmGW4
— Koichi Nakashima (@ko1nksm) October 1, 2022

↑のようなツイートがあったけど実際にそれを見てみると

0001198: Comparison of numeric string values in awk - Austin Group Defect Tracker

Make the following changes:

On page 2485 line 79876 section awk change:

　　$expr | Field reference | String | N/A

to:

　　$expr | Field reference | Uninitialized or string | N/A

On page 2489 line 80031-80033 section awk change:
　　Comparisons (with the ‘<’, “<=”, “!=”, “==”, ‘>’, and “>=” operators) shall be made
　　numerically if both operands are numeric, if one is numeric and the other has a
　　string value that is a numeric string, or if one is numeric and the other has the
　　uninitialized value. Otherwise…
to:
　　Comparisons (with the ‘<’, “<=”, “!=”, “==”, ‘>’, and “>=” operators) shall be made numerically:

　　　if both operands are numeric,

　　　if one is numeric and the other has a string value that is a numeric string,

　　　if both have string values that are numeric strings, or

　　　if one is numeric and the other has the uninitialized value.

Otherwise…

ということなので、比較の手順・内容については変わっていないのだった (未初期化変数の扱いが明確にはなった)。

参照: Expressions in awkのあたり

gawk

(2.15.6辺りで変わったと思う(要確認))。

と書いていたことのフォローアップ。例によってNEWS(実際にはNEWS.0)で調べると見つかった。

Changes from 2.12.20 to 2.12.21
-------------------------------

Corrected missing/gcvt.c.

Got rid of use of dup2() and thus DUP_MISSING.

Updated config/sgi33.

Turned on (and fixed) in cmp_nodes() the behaviour that I *hope* will be in
  POSIX 1003.2 for relational comparisons.

Small updates to test suite.

これですな。おぼろげな記憶ではこの辺(2.11のあとの、2.12～2.14)は対応するjgawkが存在していないので、まあ大外れではなかったようだ＞2.15.6辺りで変わった

ついでに過去のソースコードで関連する部分を眺めてみた。まずは見つかった中で一番古い1.0.1。

awk2.c

cmp_nodes(t1,t2)
NODE *t1,*t2;
{
  register int	di;
  register AWKNUM d;


  if(t1==t2) {
    return 0;
  }
#ifndef FAST
  if(!t1 || !t2) {
    abort();
    return t1 ? 1 : -1;
  }

#endif
  if (t1->type == Node_number && t2->type == Node_number) {
    d = t1->numbr - t2->numbr;
    if (d < 0.0)
      return -1;
    if (d > 0.0)
      return 1;
    return 0;
  }
  t1=force_string(t1);
  t2=force_string(t2);
  /* "real" awk treats things as numbers if they both "look" like numbers. */
  if (*t1->stptr && *t2->stptr	/* don't allow both to be empty strings(jfw)*/
  &&  is_a_number(t1->stptr) && is_a_number(t2->stptr)) {
	double atof();
	d = atof(t1->stptr) - atof(t2->stptr);
	if (d < 0.0) return -1;
	if (d > 0.0) return 1;
	return 0;
  }
  di = strncmp (t1->stptr, t2->stptr, min (t1->stlen, t2->stlen));
  if (di == 0)
    di = t1->stlen - t2->stlen;
  if(di>0) return 1;
  if(di<0) return -1;
  return 0;
}

コメントの "real" awk treats things as numbers if they both "look" like numbers. がちょっと気になる。そのコメントのすぐ後で呼び出している is_a_number()はというとこんなん。

/* FOO this doesn't properly compare "12.0" and 12.0 etc */
/* or "1E1" and 10 etc */
/* Perhaps someone should fix it.  */
/* Consider it fixed (jfw) */

/* strtod() would have been better, except (1) real awk is needlessly
 * restrictive in what strings it will consider to be numbers, and
 * (2) I couldn't find the public domain version anywhere handy.
 */
is_a_number(str)	/* does the string str have pure-numeric syntax? */
char *str;		/* don't convert it, assume that atof is better */
{
	if (*str == 0) return 1; /* null string has numeric value of0 */
		/* This is still a bug: in real awk, an explicit "" string
		 * is not treated as a number.  Perhaps it is only variables
		 * that, when empty, are also 0s.  This bug-lette here at
		 * least lets uninitialized variables to compare equal to
		 * zero like they should.
		 */
	if (*str == '-') str++;
	if (*str == 0) return 0;
	/* must be either . or digits (.4 is legal) */
	if (*str != '.' && !isdigit(*str)) return 0;
	while (isdigit(*str)) str++;
	if (*str == '.') {
		str++;
		while (isdigit(*str)) str++;
	}
	/* curiously, real awk DOESN'T consider "1E1" to be equal to 10!
	 * Or even equal to 1E1 for that matter!  For a laugh, try:
	 * awk 'BEGIN {if ("1E1" == 1E1) print "eq"; else print "neq";exit}'
	 * Since this behavior is QUITE curious, I include the code for the
	 * adventurous.  One might also feel like skipping leading whitespace
	 * (awk doesn't) and allowing a leading + (awk doesn't).
#ifdef Allow_Exponents
	if (*str == 'e' || *str == 'E') {
		str++;
		if (*str == '+' || *str == '-') str++;
		if (!isdigit(*str)) return 0;
		while (isdigit(*str)) str++;
	}
#endif
	/* if we have digested the whole string, we are successful */
	return (*str == 0);
}

ふむ。指数表記の(数値のような)文字列をどうするかについて「ゆれ」があるっすな。それはそれとして、比較の時点で「数値のように見えるか」を判定して必要ならば変換を行ってから比較をしている ("real" awk treats things as numbers if they both "look" like numbers.)。したがって現在のawkとは違って、プログラム中に現れたs = "2.718"みたいな変数も (もう一方のオペランドによるが)数値比較されるということか。

次に2.11。

eval.c

int
cmp_nodes(t1, t2)
NODE *t1, *t2;
{
	AWKNUM d;
	AWKNUM d1;
	AWKNUM d2;
	int ret;
	int len1, len2;

	if (t1 == t2)
		return 0;
	d1 = force_number(t1);
	d2 = force_number(t2);
	if ((t1->flags & NUMERIC) && (t2->flags & NUMERIC)) {
		d = d1 - d2;
		if (d == 0.0)	/* from profiling, this is most common */
			return 0;
		if (d > 0.0)
			return 1;
		return -1;
	}
	t1 = force_string(t1);
	t2 = force_string(t2);
	len1 = t1->stlen;
	len2 = t2->stlen;
	if (len1 == 0) {
		if (len2 == 0)
			return 0;
		else
			return -1;
	} else if (len2 == 0)
		return 1;
	ret = memcmp(t1->stptr, t2->stptr, len1 <= len2 ? len1 : len2);
	if (ret == 0 && len1 != len2)
		return len1 < len2 ? -1: 1;
	return ret;
}

NUMERIC(などの)ビットフラグが持ち込まれている。また、1.0.1にはあったis_a_number()→atof()の呼び出し (強制的な数値への変換)がなくなっている。

ざっとNUMERICが使われている箇所を確認すると

builtin.c:296:			if (arg->flags & NUMERIC) {
builtin.c:1038:		t->flags &= ~(NUM|NUMERIC);
eval.c:680:	n->flags |= (NUM|NUMERIC);
eval.c:728:	if (t1->flags & NUMERIC)
eval.c:750:	if ((t1->flags & NUMERIC) && (t2->flags & NUMERIC)) {
main.c:142:	Nnull_string->flags = (PERM|STR|NUM|NUMERIC);
node.c:57:			n->flags |= NUMERIC;
node.c:64:			n->flags |= NUMERIC;
node.c:173:	r->flags |= (NUM|NUMERIC);

面倒なのでビットフラグを使っているところの詳細は見ない。

この変更についてなにか書かれていないかと探してみるとあった。

2.11のCHANGESから。

Changes from 2.10beta to 2.11beta
略
Added new node value flag NUMERIC to indicate that a variable is purely a number as opposed to type NUM which indicates that the node’s numeric value is valid. This is set in make_number(), tmp_number and r_force_number() when appropriate and used in cmp_nodes(). This fixed a bug in comparison of variables that had numeric prefixes. The new code uses strtod() and eliminates is_a_number(). A simple strtod() is provided for systems lacking one. It does no overflow checking, so could be improved.

このコメントを信じるならば(そしてわたしの記憶が確かならば)、 s = "1.4142"のような代入ではmake_number()(など)は呼ばれていないはずなので、ここで動作がちょっと変わっていることになる。

そして2.15.6。

eval.c

/*
 * compare two nodes, returning negative, 0, positive
 */
int
cmp_nodes(t1, t2)
register NODE *t1, *t2;
{
	register int ret;
	register size_t len1, len2;

	if (t1 == t2)
		return 0;
	if (t1->flags & MAYBE_NUM)
		(void) force_number(t1);
	if (t2->flags & MAYBE_NUM)
		(void) force_number(t2);
	if ((t1->flags & NUMBER) && (t2->flags & NUMBER)) {
		if (t1->numbr == t2->numbr) return 0;
		else if (t1->numbr - t2->numbr < 0)  return -1;
		else return 1;
	}
	(void) force_string(t1);
	(void) force_string(t2);
	len1 = t1->stlen;
	len2 = t2->stlen;
	if (len1 == 0 || len2 == 0)
		return len1 - len2;
	ret = memcmp(t1->stptr, t2->stptr, len1 <= len2 ? len1 : len2);
	return ret == 0 ? len1-len2 : ret;
}

ビットフラグの名前が変わっていて、それをセットしているところも変わっている(増えている)。

eval.c:744:	if (t1->flags & MAYBE_NUM)
eval.c:766:	if (t1->flags & MAYBE_NUM)
eval.c:768:	if (t2->flags & MAYBE_NUM)
field.c:109:	n->flags = (PERM|STR|STRING|MAYBE_NUM);
field.c:190:		nodes[0]->flags = (STRING|STR|PERM|MAYBE_NUM);
field.c:193:	fields_arr[0]->flags |= MAYBE_NUM;
field.c:521:	it->flags |= MAYBE_NUM;
io.c:1170:			(*lhs)->flags |= MAYBE_NUM;
main.c:494:	(*aptr)->flags |= MAYBE_NUM;
main.c:498:		(*aptr)->flags |= MAYBE_NUM;
main.c:580:		(*aptr)->flags |= MAYBE_NUM;
main.c:621:		it->flags |= MAYBE_NUM;
node.c:69:	if (n->flags & MAYBE_NUM) {
node.c:71:		n->flags &= ~MAYBE_NUM;

面倒なので(ry ここでPOSIXと動作を合わせた。ということのはず。

最後に現状の最新版の5.2.0。

eval.c

/* cmp_nodes --- compare two nodes, returning negative, 0, positive */

int
cmp_nodes(NODE *t1, NODE *t2, bool use_strcmp)
{
	int ret = 0;
	size_t len1, len2;
	int l, ldiff;

	if (t1 == t2)
		return 0;

	(void) fixtype(t1);
	(void) fixtype(t2);

	if ((t1->flags & NUMBER) != 0 && (t2->flags & NUMBER) != 0)
		return cmp_numbers(t1, t2);

	(void) force_string(t1);
	(void) force_string(t2);
	len1 = t1->stlen;
	len2 = t2->stlen;
	ldiff = len1 - len2;
	if (len1 == 0 || len2 == 0)
		return ldiff;

	if (do_posix && ! use_strcmp)
		return posix_compare(t1, t2);

	l = (ldiff <= 0 ? len1 : len2);
	if (IGNORECASE) {
		const unsigned char *cp1 = (const unsigned char *) t1->stptr;
		const unsigned char *cp2 = (const unsigned char *) t2->stptr;
		char save1 = t1->stptr[t1->stlen];
		char save2 = t2->stptr[t2->stlen];


		if (gawk_mb_cur_max > 1) {
			t1->stptr[t1->stlen] = t2->stptr[t2->stlen] = '\0';
			ret = strncasecmpmbs((const unsigned char *) cp1,
					     (const unsigned char *) cp2, l);
			t1->stptr[t1->stlen] = save1;
			t2->stptr[t2->stlen] = save2;
		} else {
			/* Could use tolower() here; see discussion above. */
			for (ret = 0; l-- > 0 && ret == 0; cp1++, cp2++)
				ret = casetable[*cp1] - casetable[*cp2];
		}
	} else
		ret = memcmp(t1->stptr, t2->stptr, l);

	ret = ret == 0 ? ldiff : ret;
	return ret;
}

ずいぶんと「毛深く」なって。 IGNORECASE対応(とマルチバイト文字対応)が見えますね。ただ、IGNORECASE自体はは2.xの時代からあったと思うけど?🤔

cmp_numbersやposix_compareが気になる人もいるかもしれないけど posix_compareはこんな関数だし、 cmp_numbersはMPFRが使われているかどうかで内容を切り替えるための関数ポインターなので今回の議論にはあまり関係しない。

/* posix_compare --- compare strings using strcoll */

static int
posix_compare(NODE *s1, NODE *s2)
{
	int ret;

	if (gawk_mb_cur_max == 1) {
		char save1, save2;
		const char *p1, *p2;

		save1 = s1->stptr[s1->stlen];
		s1->stptr[s1->stlen] = '\0';

		save2 = s2->stptr[s2->stlen];
		s2->stptr[s2->stlen] = '\0';

		p1 = s1->stptr;
		p2 = s2->stptr;

		for (;;) {
			size_t len;

			ret = strcoll(p1, p2);
			if (ret != 0)
				break;

			len = strlen(p1);
			p1 += len + 1;
			p2 += len + 1;

			if (p1 == s1->stptr + s1->stlen + 1) {
				if (p2 != s2->stptr + s2->stlen + 1)
					ret = -1;
				break;
			}
			if (p2 == s2->stptr + s2->stlen + 1) {
				ret = 1;
				break;
			}
		}

		s1->stptr[s1->stlen] = save1;
		s2->stptr[s2->stlen] = save2;
	}
	else {
		/* Similar logic, using wide characters */
		const wchar_t *p1, *p2;

		(void) force_wstring(s1);
		(void) force_wstring(s2);

		p1 = s1->wstptr;
		p2 = s2->wstptr;

		for (;;) {
			size_t len;

			ret = wcscoll(p1, p2);
			if (ret != 0)
				break;

			len = wcslen(p1);
			p1 += len + 1;
			p2 += len + 1;

			if (p1 == s1->wstptr + s1->wstlen + 1) {
				if (p2 != s2->wstptr + s2->wstlen + 1)
					ret = -1;
				break;
			}
			if (p2 == s2->wstptr + s2->wstlen + 1) {
				ret = 1;
				break;
			}
		}
	}

	return ret;
}

onetrueawk

gawkのあとはonetrueawkも見てみよう。比較を行っているのはここ。

awk/run.c at master · onetrueawk/awk

Cell *relop(Node **a, int n)	/* a[0 < a[1], etc. */
{
	int i;
	Cell *x, *y;
	Awkfloat j;

	x = execute(a[0]);
	y = execute(a[1]);
	if (x->tval&NUM && y->tval&NUM) {
		j = x->fval - y->fval;
		i = j<0? -1: (j>0? 1: 0);
	} else {
		i = strcmp(getsval(x), getsval(y));
	}
	tempfree(x);
	tempfree(y);
	switch (n) {
	case LT:	if (i<0) return(True);
			else return(False);
	case LE:	if (i<=0) return(True);
			else return(False);
	case NE:	if (i!=0) return(True);
			else return(False);
	case EQ:	if (i == 0) return(True);
			else return(False);
	case GE:	if (i>=0) return(True);
			else return(False);
	case GT:	if (i>0) return(True);
			else return(False);
	default:	/* can't happen */
		FATAL("unknown relational operator %d", n);
	}
	return 0;	/*NOTREACHED*/
}

一言でいうとif (x->tval&NUM && y->tval&NUM) {で判定して比較演算子のオペランドの両方ともビットフラグNUMがオンであったときに数値比較をする。と。そしてそのビットフラグを立てているところを探すと

awk/lib.c

if (is_number(fldtab[0]->sval, & result)) {
	fldtab[0]->fval = result;
	fldtab[0]->tval |= NUM;
}

awk/lib.c

if (is_number(q->sval, & result)) {
	q->fval = result;
	q->tval |= NUM;
}

awk/lib.c

for (j = 1; j <= lastfld; j++) {
	double result;

	p = fldtab[j];
	if(is_number(p->sval, & result)) {
		p->fval = result;
		p->tval |= NUM;
	}
}

こんな感じ。

1番目はレコード($0)の初期化、 2番目は変数の代入、 3番目はフィールド($1, $2, …, $NF)の初期化。

ま、ENVIRONやARGVの要素の扱いが気になるけどまあ「仕様通り」すね。

追記:
ARGVとENVIRONはそれぞれ arginit awk/tran.c や envinit awk/tran.c で処理していた(NUMビットを立てていた)。

ところでNUMとは関係ないけど気になったコメント。

awk/lex.c

case '"':
return string();        /* BUG: should be like tran.c ? */

なにをしようと?＞"should be like tran.c"

v7

ついでにv7のも。

v7unix/run.c

obj relop(a,n) node **a;
{
	int i;
	obj x, y;
	awkfloat j;

	x = execute(a[0]);
	y = execute(a[1]);
	if (x.optr->tval&NUM && y.optr->tval&NUM) {
		j = x.optr->fval - y.optr->fval;
		i = j<0? -1: (j>0? 1: 0);
	} else {
		i = strcmp(getsval(x.optr), getsval(y.optr));
	}
	tempfree(x);
	tempfree(y);
	switch (n) {
	default:
		error(FATAL, "unknown relational operator %d", n);
	case LT:	if (i<0) return(true);
			else return(false);
	case LE:	if (i<=0) return(true);
			else return(false);
	case NE:	if (i!=0) return(true);
			else return(false);
	case EQ:	if (i==0) return(true);
			else return(false);
	case GE:	if (i>=0) return(true);
			else return(false);
	case GT:	if (i>0) return(true);
			else return(false);
	}
}

比較の部分そのものは今のとあまり変わってないすな。じゃあNUMがどのあたりで使われているかと調べると

>grep -w -e NUM *.c awk*
lib.c:                  fldtab[i].tval |= NUM;
run.c:static cell nullval ={0,0,0.0,NUM,0};
run.c:          recloc->tval &= ~NUM;
run.c:          nrloc->tval |= NUM;
run.c:  if (x.optr->tval&NUM && y.optr->tval&NUM) {
run.c:          if (y.optr->tval&NUM) setfval(x.optr, y.optr->fval);
run.c:  x.optr->tval = NUM;
tran.c: setsymtab("0", tostring("0"), 0.0, NUM|STR|CON|FLD, symtab);
tran.c: nfloc = setsymtab("NF", NULL, 0.0, NUM, symtab);
tran.c: nrloc = setsymtab("NR", NULL, 0.0, NUM, symtab);
tran.c: vp->tval |= NUM;        /* mark number ok */
tran.c: vp->tval &= ~NUM;
tran.c: if ((vp->tval & NUM) == 0) {
tran.c:                         vp->tval |= NUM;
tran.c: if ((vp->tval & (NUM | STR)) == 0)
awk.def:#define NUM     02      /* number value is valid */
awk.lx.l:<A>NF          { mustfld=1; yylval = setsymtab(yytext, NULL, 0.0, NUM, symtab); RETURN(VAR); }
awk.lx.l:               yylval = setsymtab(yytext, NULL, atof(yytext), CON|NUM, symtab); RETURN(NUMBER); }

1行目のfldtab[i].tval |= NUM;は変数名からしてフィールドに対してやってるんでしょう。その他に|でビットフラグを立てているのは run.cで一ヶ所、tranc.cで三ヶ所あるけど純粋に数値の操作で立てているのは除外して

run.c

obj program(a, n) node **a;
{
	obj x;

	if (a[0] != NULL) {
		x = execute(a[0]);
		if (isexit(x))
			return(true);
		if (isjump(x))
			error(FATAL, "unexpected break, continue or next");
		tempfree(x);
	}
	while (getrec()) {
		recloc->tval &= ~NUM;
		recloc->tval |= STR;
		++nrloc->fval;
		nrloc->tval &= ~STR;
		nrloc->tval |= NUM;
		x = execute(a[1]);
		if (isexit(x)) break;
		tempfree(x);
	}
	tempfree(x);
	if (a[2] != NULL) {
		x = execute(a[2]);
		if (isbreak(x) || isnext(x) || iscont(x))
			error(FATAL, "unexpected break, continue or next");
		tempfree(x);
	}
	return(true);
}

tran.c

awkfloat getfval(vp)
register cell *vp;
{
	awkfloat atof();

	if (vp->sval == record && donerec == 0)
		recbld();
	dprintf("getfval: %o", vp, NULL, NULL);
	checkval(vp);
	if ((vp->tval & NUM) == 0) {
		/* the problem is to make non-numeric things */
		/* have unlikely numeric variables, so that */
		/* $1 == $2 comparisons sort of make sense when */
		/* one or the other is numeric */
		if (isnumber(vp->sval)) {
			vp->fval = atof(vp->sval);
			if (!(vp->tval & CON))	/* don't change type of a constant */
				vp->tval |= NUM;
		}
		else
			vp->fval = 0.0;	/* not a very good idea */
	}
	dprintf("  %g\n", vp->fval, NULL, NULL);
	return(vp->fval);
}

lib.c

fldbld()
{
	register char *r, *fr, sep;
	int i, j;

	r = record;
	fr = fields;
	if ((sep = **FS) == ' ')
		for (i = 0; ; ) {
			while (*r == ' ' || *r == '\t' || *r == '\n')
				r++;
			if (*r == 0)
				break;
			i++;
			if (i >= MAXFLD)
				error(FATAL, "record `%.20s...' has too many fields", record);
			if (!(fldtab[i].tval&FLD))
				xfree(fldtab[i].sval);
			fldtab[i].sval = fr;
			fldtab[i].tval = FLD | STR;
			do
				*fr++ = *r++;
			while (*r != ' ' && *r != '\t' && *r != '\n' && *r != '\0');
			*fr++ = 0;
		}
	else
		for (i = 0; ; ) {
			i++;
			if (i >= MAXFLD)
				error(FATAL, "record `%.20s...' has too many fields", record);
			if (!(fldtab[i].tval&FLD))
				xfree(fldtab[i].sval);
			fldtab[i].sval = fr;
			fldtab[i].tval = FLD | STR;
			while (*r != sep && *r != '\n' && *r != '\0')	/* \n always a separator */
				*fr++ = *r++;
			*fr++ = '\0';
			if (*r == 0) break;
			r++;
		}
	*fr = 0;
	for (j=maxfld; j>i; j--) {	/* clean out junk from previous record */
		if (!(fldtab[j].tval&FLD))
			xfree(fldtab[j].sval);
		fldtab[j].tval = STR | FLD;
		fldtab[j].sval = NULL;
	}
	maxfld = i;
	donefld = 1;
	for(i=1; i<=maxfld; i++)
		if(isnumber(fldtab[i].sval))
		{	fldtab[i].fval = atof(fldtab[i].sval);
			fldtab[i].tval |= NUM;
		}
	setfval(lookup("NF", symtab), (awkfloat) maxfld);
	if (dbg)
		for (i = 0; i <= maxfld; i++)
			printf("field %d: |%s|\n", i, fldtab[i].sval);
}

いじょ。

≪ prev L'invasion du silence

next ≫ アンバランス