ときどきの雑記帖 RE* (新南口)

そして僕は途方に暮れる

September 4, 2021

三省堂

神保町もご無沙汰してるなあ。

2022年3月下旬にて神保町本店の営業を終了し、同年4 月より解体を開始、その後2025～6年頃の竣工を予定しております。

新しい建物はどんな感じになるんだろうか。地下と2階にあるお店もどうなるかちょっと気になる。

Excel

これのプログラムを書くのが流行ったのって何年前でしたっけ?

[golang]Excelの列アルファベットを計算する - Qiita

列アルファベットの算出ロジックはMicrosoftの公式サイトにVBのコードですが載っています。

これは知らなかった。ということで見てみる。

Excel の列番号をアルファベット順に変換する方法 - Office | Microsoft Docs

機械翻訳された日本語版は…なので、英語版を。

Let iCol be the column number. Stop if iCol is less than 1.

Calculate the quotient and remainder on division of (iCol - 1) by 26, and store in variables a and b.

Convert the integer value of b into the corresponding alphabetical character (0 => A, 25 => Z) and tack it on at the front of the result string.

Set iCol to the divisor a and loop.

逆方向の変換は記載がないんですね。

awk -v

awk の -v オプションに潜む罠～任意の値を渡す時はエスケープ処理が必要です！? - Qiita という記事に

“だいたい"とわざわざ言ってるのは、少し例外処理があるからで -v オプションの終わりが \ で終わる場合は、BEGIN { bs=”" } ではなくリテラルの \ として扱われると書いてあるからです。各実装によって次のような違いがありました。

$ gawk -v bs='a\' 'BEGIN{print bs "" }'
a

$ mawk -v bs='a\' 'BEGIN{print bs "" }'
a\

$ original-awk -v bs='a\' 'BEGIN{print bs "" }'
a\

「リテラルの \ として扱われる」に解釈の幅があるような気がしますが、この点に関しては gawk は POSIX に準拠してないということでしょうか？単に後続の " を食ってシンタックスエラーにならない（場合によっては脆弱性にはならない）という意味なだけかもしれません。もっとも普通は値の最後を \ で終わらたりはしなのでこの違いが処理に大きく影響することはないでしょう。ちなみに –posix や –traditional オプションを付けても同じでした。

といった記述があったので、いつものようにソースコードを(ry

gawk

まず gawk。

vオプションを処理しているところは parse_argsという関数内にあって

main.c - gawk.git - gawk

case 'v':
        add_preassign(PRE_ASSIGN, optarg);
        break;

add_preassign という以下にもな名前の関数を呼んでいる。そしてその関数はどんな内容なのかというと

main.c - gawk.git - gawk

add_preassign(enum assign_type type, char *val)
{
        static long alloc_assigns;              /* for how many are allocated */

#define INIT_SRC 4

        ++numassigns;

        if (preassigns == NULL) {
                emalloc(preassigns, struct pre_assign *,
                        INIT_SRC * sizeof(struct pre_assign), "add_preassign");
                alloc_assigns = INIT_SRC;
        } else if (numassigns >= alloc_assigns) {
                alloc_assigns *= 2;
                erealloc(preassigns, struct pre_assign *,
                        alloc_assigns * sizeof(struct pre_assign), "add_preassigns");
        }
        preassigns[numassigns].type = type;
        preassigns[numassigns].val = estrdup(val, strlen(val));

#undef INIT_SRC
}

preassigns という配列に type(preassignsに対する第一引数)と val(preassignsに対する第一引数。つまり-vオプションについてきた文字列) を保存している。

pre_preassign が使われているところを探してみるとそれほど数はなく

main.c:142:static void add_preassign(enum assign_type type, char *val);
main.c:550:/* add_preassign --- add one element to preassigns */
main.c:553:add_preassign(enum assign_type type, char *val)
main.c:563:                     INIT_SRC * sizeof(struct pre_assign), "add_preassign");
main.c:568:                     alloc_assigns * sizeof(struct pre_assign), "add_preassigns");
main.c:1543:                    add_preassign(PRE_ASSIGN_FS, optarg);
main.c:1570:                    add_preassign(PRE_ASSIGN, optarg);

実質1543行目の add_preassign(PRE_ASSIGN_FS, optarg); と 1570行目の add_preassign(PRE_ASSIGN, optarg); だけのようだ (ここで注意。今回コード読みに使っているのは5.1.0のもので他のバージョンでは何かしら変わっている可能性がある)。

先ほど見たのとは別のadd_preassignを呼び出しているところを見ると Fオプションの処理部分だった。なるほどtypeとして渡している PRE_ASSIGN_FS のFSはFオプションに関連して。と。

main.c - gawk.git - gawk

case 'F':
        add_preassign(PRE_ASSIGN_FS, optarg);
        break;

さてpre_preassignで設定したpreassignsはどこでどのように使われているのかというと

main.c:409:             if (preassigns[i].type == PRE_ASSIGN)
main.c:410:                     dash_v_errs += (arg_assign(preassigns[i].val, true) == false);
main.c:412:                     cmdline_fs(preassigns[i].val);
main.c:413:             efree(preassigns[i].val);
main.c:570:     preassigns[numassigns].type = type;
main.c:571:     preassigns[numassigns].val = estrdup(val, strlen(val));

arg_assign(char *arg, bool initing) という関数

main.c - gawk.git - gawk

// string assignment

// POSIX disallows any newlines inside strings
// The scanner handles that for program files.
// We have to check here for strings passed to -v.
if (do_posix && strchr(cp, '\n') != NULL)
	fatal(_("POSIX does not allow physical newlines in string values"));

/*
 * BWK awk expands escapes inside assignments.
 * This makes sense, so we do it too.
 * In addition, remove \-<newline> as in scanning.
 */
it = make_str_node(cp, strlen(cp), SCAN | ELIDE_BACK_NL);
it->flags |= USER_INPUT;

node.cのmake_str_node(const char *s, size_t len, int flags)

node.c - gawk.git - gawk

if (c == '\\') {
	c = parse_escape(&pf);
	if (c < 0) {
		if (do_lint)
			lintwarn(_("backslash string continuation is not portable"));
		if ((flags & ELIDE_BACK_NL) != 0)
			continue;
		c = '\\';
	}
	*ptm++ = c;
} else
	*ptm++ = c;

node.c の parse_escape(const char **string_ptr)

node.c - gawk.git - gawk

parse_escape(const char **string_ptr)
{
	int c = *(*string_ptr)++;

関数の先頭でcに取り出した文字(\に続いておかれている文字)を

node.c - gawk.git - gawk

switch (c) {
case 'a':
	return '\a';
case 'b':
	return '\b';
case 'f':
	return '\f';
case 'n':
	return '\n';
case 'r':
	return '\r';
case 't':
	return '\t';
case 'v':
	return '\v';
case '\n':
	return -2;
case 0:
	(*string_ptr)--;
	return -1;

と処理していて、今回は何もない(文字列終端の\0がc)のでポインターを一つ戻して関数からが-1を返す。 parse_escapeの呼び出し元では後続の if (c < 0)の部分を実行するが、ここでさらにmake_str_nodeの呼び出しで ELIDE_BACK_NLが渡されているので \が取り除かれる流れになる。

これが意図したものかどうかはわからん (マニュアルのどこかに書いてあるかもしれないが探していない)。

mawk

rm_escape(char *s, size_t *lenp)

if (*p == '\\')
{
   escape_test[ET_END].in = *++p ; /* sentinal */
   i = 0 ;
   while (escape_test[i].in != *p)  i++ ;

   if (i != ET_END)        /* in table */
   {
      p++ ;
      *q++ = escape_test[i].out ;
   }
   else if (isoctal(*p))
   {
      t = p ;
      *q++ = octal(&t) ;
      p = t ;
   }
   else if (*p == 'x' && ishex(*(unsigned char *) (p + 1)))
   {
      t = p + 1 ;
      *q++ = hex(&t) ;
      p = t ;
   }
   else if (*p == 0)        /* can only happen with command line assign */
      *q++ = '\\' ;
   else  /* not an escape sequence */
   {
      *q++ = '\\' ;
      *q++ = *p++ ;
   }
}

Busybox

mirror/busybox: BusyBox mirror

busybox/awk.c at master · mirror/busybox

opt = getopt32(argv, OPTSTR_AWK, &opt_F, &list_v, &list_f, IF_FEATURE_AWK_GNU_EXTENSIONS(&list_e,) NULL);
argv += optind;
//argc -= optind;
if (opt & OPT_W)
	bb_simple_error_msg("warning: option -W is ignored");
if (opt & OPT_F) {
	unescape_string_in_place(opt_F);
	setvar_s(intvar[FS], opt_F);
}
while (list_v) {
	if (!is_assignment(llist_pop(&list_v)))
		bb_show_usage();
}

busybox/awk.c at master · mirror/busybox

static int is_assignment(const char *expr)
{
	char *exprc, *val;

	val = (char*)endofname(expr);
	if (val == (char*)expr || *val != '=') {
		return FALSE;
	}

	exprc = xstrdup(expr);
	val = exprc + (val - expr);
	*val++ = '\0';

	unescape_string_in_place(val);
	setvar_u(newvar(exprc), val);
	free(exprc);
	return TRUE;
}

busybox/awk.c at master · mirror/busybox

static void unescape_string_in_place(char *s1)
{
	char *s = s1;
	while ((*s1 = nextchar(&s)) != '\0')
		s1++;
}

one true awk

awk/main.c at f9affa922c5e074990a999d486d4bc823590fd93 · onetrueawk/awk

case 'v':       /* -v a=1 to be done NOW.  one -v for each */
        vn = getarg(&argc, &argv, "no variable name");
        if (isclvar(vn))
                setclvar(vn);
        else
                FATAL("invalid -v option argument: %s", vn);
        break;

awk/tran.c at master · onetrueawk/awk

else if (c != '\\')
        *bp++ = c;
else {  /* \something */
        c = *++s;
        if (c == 0) {   /* \ at end */
                *bp++ = '\\';
                break;  /* for loop */
        }

busybox の getopt32

busybox/awk.c at master · mirror/busybox

opt = getopt32(argv, OPTSTR_AWK, &opt_F, &list_v, &list_f, IF_FEATURE_AWK_GNU_EXTENSIONS(&list_e,) NULL);

busybox/getopt32.c at master · mirror/busybox

busybox/getopt32.c at master ・ mirror/busybox

uint32_t FAST_FUNC
getopt32(char **argv, const char *applet_opts, ...)
{
	uint32_t opt;
	va_list p;

	va_start(p, applet_opts);
	opt = vgetopt32(argv, applet_opts, NULL, p);
	va_end(p);
	return opt;
}

busybox/getopt32.c at master · mirror/busybox

vgetopt32(char **argv, const char *applet_opts, const char *applet_long_options, va_list p)
{
	int argc;
	unsigned flags = 0;

one true awk

onetrueawk/awk: One true awk を見ていて onetrueawk/awk: One true awk に 85 reduce/reduce という記述があることに気がついた。

yacc -d awkgram.y
conflicts: 43 shift/reduce, 85 reduce/reduce
mv y.tab.c ytab.c
mv y.tab.h ytab.h
cc -c ytab.c
cc -c b.c
cc -c main.c
cc -c parse.c
cc maketab.c -o maketab
./maketab >proctab.c
cc -c proctab.c
cc -c tran.c
cc -c lib.c
cc -c run.c
cc -c lex.c
cc ytab.o b.o main.o parse.o proctab.o tran.o lib.o run.o lex.o -lm

shift/reduceはさておき、 reduce/reduceでそんなに衝突してる文法だったったっけ?

LC_NUMERIC

gawkのprintfにある thousans separator 付加の機能に関連してこんなコメントをソースコードで発見 (前にも見ていたはずだけど忘れていた)。

#if defined(LC_NUMERIC)
        /*
         * See comment above about using locale's decimal point.
         *
         * 10/2005:
         * Bitter experience teaches us that most people the world over
         * use period as the decimal point, not whatever their locale
         * uses.  Thus, only use the locale's decimal point if being
         * posixly anal-retentive.
         *
         * 7/2007:
         * Be a little bit kinder. Allow the --use-lc-numeric option
         * to also use the local decimal point. This avoids the draconian
         * strictness of POSIX mode if someone just wants to parse their
         * data using the local decimal point.
         */
        if (use_lc_numeric)
                setlocale(LC_NUMERIC, locale);
#endif

localeがらみはとにかく面倒が多い。

egrep/fgrep

8月中にリリースされたもの August GNU Spotlight with Mike Gerwitz: 13 new GNU releases! を見ると

13 new GNU releases in the last month (as of August 29, 2021):

diffutils-3.8
gcc-11.2
glibc-2.34
gnunet-0.15.3
gnupg-2.3.2
grep-3.7
help2man-1.48.5
mailutils-3.13
mcron-1.2.1
mtools-4.0.35
mygnuhealth-1.0.4
parallel-20210822
taler-0.8

grepがあった。

ということで grep.git - grep をみるとこんな内容

egrep, fgrep: now obsolete
* NEWS: Mention this (see bug#49996).
* doc/Makefile.am (egrep.1 fgrep.1): Remove. All uses removed.
* doc/grep.in.1, doc/grep.texi (grep Programs):
Remove documentation for egrep, fgrep.
* doc/grep.texi (Usage): Add FAQ for egrep and fgrep.
* src/Makefile.am (shell_does_substrings): Substitute for ${0##*/},
not for ${0%/*} (which was not being used anyway).
* src/egrep.sh: Issue an obsolescence warning.
* tests/fedora: Use “grep -F” instead of “fgrep” in diagnostics,
as this tests “grep -F” not “fgrep”.

diff --git a/NEWS b/NEWS
index 39a0903..4a62fb7 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,11 @@ GNU grep NEWS                                    -*- outline -*-
 
 * Noteworthy changes in release ?.? (????-??-??) [?]
 
+** Changes in behavior
+
+  The egrep and fgrep commands, which have been deprecated since
+  release 2.5.3 (2007), now warn that they are obsolescent and should
+  be replaced by grep -E and grep -F.
 
 * Noteworthy changes in release 3.7 (2021-08-14) [stable]
 
diff --git a/doc/grep.in.1 b/doc/grep.in.1
index e8854f2..b014f65 100644
--- a/doc/grep.in.1
+++ b/doc/grep.in.1
@@ -137,7 +137,7 @@
 .hy 0
 .
 .SH NAME
-grep, egrep, fgrep \- print lines that match patterns
+grep \- print lines that match patterns
 .
 .SH SYNOPSIS
 .B grep
@@ -184,17 +184,6 @@ If no
 .I FILE
 is given, recursive searches examine the working directory,
 and nonrecursive searches read standard input.
-.PP
-In addition, the variant programs
-.B egrep
-and
-.B fgrep
-are the same as
-.B "grep\ \-E"
-and
-.BR "grep\ \-F" ,
-respectively.
-These variants are deprecated, but are provided for backward compatibility.
 .
 .SH OPTIONS
 .SS "Generic Program Information"
diff --git a/doc/grep.texi b/doc/grep.texi
index 63d2fc9..3236b98 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -1159,15 +1159,6 @@ combined with the @option{-z} (@option{--null-data}) option, and note that
 
 @end table
 
-In addition,
-two variant programs @command{egrep} and @command{fgrep} are available.
-@command{egrep} is the same as @samp{grep@ -E}.
-@command{fgrep} is the same as @samp{grep@ -F}.
-Direct invocation as either
-@command{egrep} or @command{fgrep} is deprecated,
-but is provided to allow historical applications
-that rely on them to run unmodified.
-
 
 @node Regular Expressions
 @chapter Regular Expressions
@@ -1918,7 +1909,7 @@ before giving it to @command{grep}, or turn to @command{awk},
 designed to operate across lines.
 
 @item
-What do @command{grep}, @command{fgrep}, and @command{egrep} stand for?
+What do @command{grep}, @option{-E}, and @option{-F} stand for?
 
 The name @command{grep} comes from the way line editing was done on Unix.
 For example,
@@ -1930,8 +1921,29 @@ global/regular expression/print
 g/re/p
 @end example
 
-@command{fgrep} stands for Fixed @command{grep};
-@command{egrep} stands for Extended @command{grep}.
+The @option{-E} option stands for Extended @command{grep}.
+The @option{-F} option stands for Fixed @command{grep};
+
+@item
+What happened to @command{egrep} and @command{fgrep}?
+
+7th Edition Unix had commands @command{egrep} and @command{fgrep}
+that were the counterparts of the modern @samp{grep -E} and @samp{grep -F}.
+Although breaking up @command{grep} into three programs was perhaps
+useful on the small computers of the 1970s, @command{egrep} and
+@command{fgrep} were not standardized by POSIX and are no longer needed.
+In the current GNU implementation, @command{egrep} and @command{fgrep}
+issue a warning and then act like their modern counterparts;
+eventually, they are planned to be removed entirely.
+
+If you prefer the old names, you can use use your own substitutes,
+such as a shell script named @command{egrep} with the following
+contents:
+
+@example
+#!/bin/sh
+exec grep -E "$@@"
+@end example
 
 @end enumerate
 
diff --git a/src/egrep.sh b/src/egrep.sh
index 6d6c15a..a0d1694 100644
--- a/src/egrep.sh
+++ b/src/egrep.sh
@@ -1,2 +1,4 @@
 #!@SHELL@
+cmd=${0##*/}
+echo "$cmd: warning: $cmd is obsolescent; using @grep@ @option@" >&2
 exec @grep@ @option@ "$@"

egrep/fgrepという記述がなくなったり書き換えられていたりしますな。

GNU grep 2.5.1

前回のtestの話に関連して。

古のGNU grepでは自分がgrepという名で起動されたのかあるいはegrepという名で起動されたのかで動作を変えていたりしていたけど (今はやっていないはず)、 testではどうなのか。

と書いたけど、これに関して grep.git - grep でバージョンごとに見ていくと色々変わっていて、 2.5.1では

grep.c\src - grep.git - grep

int
main (int argc, char **argv)
{
  char *keys;
  size_t keycc, oldcc, keyalloc;
  int with_filenames;
  int opt, cc, status;
  int default_context;
  FILE *fp;
  extern char *optarg;
  extern int optind;

  initialize_main (&argc, &argv);
  program_name = argv[0];
  if (program_name && strrchr (program_name, '/'))
    program_name = strrchr (program_name, '/') + 1;

  if (!strcmp(program_name, "egrep"))
    setmatcher ("egrep");
  if (!strcmp(program_name, "fgrep"))
    setmatcher ("fgrep");

のように「ド直球」な処理 (まさに起動されたときの名前での切り替え)。インストール時にgrepに対するハードリンクを張るなどの手段を使って egrepとfgrepを作っていたと思う (面倒なのでconfigureを追いかけたりはしない)。

GNU grep 2.5.3

それが The egrep and fgrep commands, which have been deprecated since release 2.5.3 (2007) な2.5.3ではどうなっているかというと src - grep.git - grep にあるファイルには grep.c のほかに egrep.c と fgrep.c があって、それら中身は

#define EGREP_PROGRAM
#include "grep.c"

#define FGREP_PROGRAM
#include "grep.c"

もうひとつ search.c とそれに対応するであろう esearch.c と fsearch.c があって、その中身は

#define EGREP_PROGRAM
#include "search.c"

#define FGREP_PROGRAM
#include "search.c"

バージョンを追っていくと 2.18 や 2.19 ではシェルスクリプトになっていて

egrep.sh\src - grep.git - grep

#!@SHELL@
grep=grep
case $0 in
  */*)
    dir=${0%/*}
    if test -x "$dir/@grep@"; then
      PATH=$dir:$PATH
      grep=@grep@
    fi;;
esac
exec $grep @option@ "$@"

こんな感じ(インストール時に@で囲まれた部分が適切な内容に置き換えられる)。

POSIX

NEWS - grep.git - grep を見ていくとこんな記述が見つかった。

NEWS - grep.git - grep

Version 2.0:

The most important user visible change is that egrep and fgrep have disappeared as separate programs into the single grep program mandated by POSIX 1003.2. New options -G, -E, and -F have been added, selecting grep, egrep, and fgrep behavior respectively. For compatibility with historical practice, hard links named egrep and fgrep are also provided. See the manual page for details.

In addition, the regular expression facilities described in Posix draft 11.2 are now supported, except for internationalization features related to locale-dependent collating sequence information.

There is a new option, -L, which is like -l except it lists files which don’t contain matches. The reason this option was added is because ‘-l -v’ doesn’t do what you expect.

grep.git - grep

≪ prev 船を出すのなら九月

next ≫ Alone in the Dark