ときどきの雑記帖 RE* (新南口)

変更履歴

June 21, 2020

grep の -l オプション (一覧表示) と -v (条件反転) オプションを併用すると死ぬ - Qiita という記事に

おそらく grep 作者としては

「-vl の動きは直感的じゃないから直したいけど破壊的変更になっちゃうから -L 新しく作っておくか〜」

といったところでしょう。歴史は特に調べてないので知りませんが…
（もし興味ある方がいらっしゃいましたら調べてコメントしていただけると嬉しいです）

とあったのが(「そりゃちがうだろう」という方向で)気になったのでちょっと調べた。

まずはドキュメントから

「仕様バグ」ってことはないと思うけど、「-l -v を組み合わせたときの動作を勘違いする人が少なからずいるので GNU grep では -L を追加した」という話を聞いた覚えがあったので記憶をたどりながらちょっと調べてみた。

まずはGNU grep のソースのtar玉に同梱のドキュメント (grep.texi\doc - grep.git - grep) を見ると FAQ にこんな記述があった。

Why doesn’t @samp{grep -lv} print non-matching file names?

@samp{grep -lv} lists the names of all files containing one or more lines that do not match. To list the names of all files that contain no matching lines, use the @option{-L} or @option{–files-without-match} option.

FAQにあるということは、まあそういうことなんでしょうねえ。

そもそも -v とか -l ってどんなオプションだっけ?

POSIX 的には -L はないけど -l と -v はあって (grep)、それぞれ

-l
(The letter ell.) Write only the names of files containing selected lines to standard output. Pathnames shall be written once per file searched.

-v
Select lines not matching any of the specified patterns. If the -v option is not specified, selected lines shall be those that match any of the specified patterns.

先の記事にあった数式はよくわからんけど、-v は与えたパターンにマッチしなかった行を出力する -l は出力する内容を条件を満たした行からファイルの名前にすりかえる(ただし出力するのは一回だけ) ということでそれは現状の動作と問題なく一致するし、やはり「仕様バグ」にはならないんじゃないか?。

もちろんそれと、現行の -L で行うようなオペレーションのニーズがあり、それが -l -v の組み合わせでは (できそうな気もするけど)できない。というのは仕様云々とは別の話。

-L が追加されたのはいつのこと?

いつごろ－L が追加されたのかを同じくGNU grep のtar玉のNEWS から

Version 2.0 のところに

There is a new option, -L, which is like -l except it lists files which don’t contain matches. The reason this option was added is because ‘-l -v’ doesn’t do what you expect.

というのがあったので 2.0 で確定ということでいいだろうけどこれのリリースがいつごろだったのはNEWSには記載がなかった (tar玉そのものの日付でいいのかという話もあるけど Index of /gnu/grep にあるものでは1996-10-01になっている)。

ChangeLog (ChangeLog-2009 - grep.git - grep) の方も確認すると -L に言及しているエントリーはいくつかあったけど一番日付が古かったのはこれ。ただ、これは処理を改良したということであって実装したというものではない。

1997-07-17 Alain Magloire
* speed hack for -l, -L: bail out on first match.
From Scott Weikart.

(2.0のtar玉の日付よりあとじゃん…)

ここでソースそのものから情報を得るのをあきらめて、 GNU’s Bulletin (日本語版はGNUダイジェスト)という過去FSFが発行していたものをみると

1993年1月発行の GNU’s Bulletin, vol. 1 no. 14, January, 1993 - GNU Project - Free Software Foundation では

grep/egrep 1.6 and fgrep 1.1 The [ef]grep programs are GNU’s versions of the Unix programs of the same name. They are much faster than the traditional Unix versions.

おなじく1993年6月発行の GNU’s Bulletin, vol. 1 no. 15, June, 1993 - GNU Project - Free Software Foundation では

grep/egrep/fgrep 2.0 The [ef]grep programs are GNU’s versions of the Unix programs of the same name. They are much faster than the traditional Unix versions.

というのが見つかった。二つの号の発行時期を考えると、2.0のリリースは 1993年前半のどこかということだろう。当時のプログラミング雑誌やNews Groupを追いかければもっと細かく追いかけられるだろうけど面倒なのでそこまではやらない。

6/22追記

ChangeLog にちゃんと書いてあった。

Sat May 22 22:59:48 1993 Mike Haertel (mike@cs.uoregon.edu)
  * Version 2.0 released.

カレントのChangeLog-2009だとだとここ(ChangeLog-2009 - grep.git - grep)

コードをちょっと覗いてみる

2.0の正確なリリース時期はさておき、 GNU grep のたどれる最古の2.0と最近の3.3とでコマンドラインオプション解析の部分を見てみるとこんな感じ。

2.0

      case 'L':
        /* Like -l, except list files that don't contain matches.
           Inspired by the same option in Hume's gre. */
        out_quiet = 1;
        list_files = -1;
        break;
      case 'l':
        out_quiet = 1;
        list_files = 1;
        break;
...
      case 'v':
        out_invert = 1;
        break;

3.3

      case 'L':
        /* Like -l, except list files that don't contain matches.
           Inspired by the same option in Hume's gre. */
        list_files = LISTFILES_NONMATCHING;
        break;

      case 'l':
        list_files = LISTFILES_MATCHING;
        break;
...
      case 'v':
        out_invert = 1;
        break;

3.3ではフラグに代入している値の意味がわかりやすくなってますな。もっともここではフラグ変数の設定をしているだけで実際の処理は別のところでやっている (のでこれだけでは -l と -v を独立に処理しているとは言い切れない) のだけどここではそこまで踏み込まない。

2.0より前は

2.0 より前のソースコードはGNUのftpサイトにも置かれていない。 2.0 になる直前くらいのChangeLogがあれば -L オプション導入に関しての情報がもう少し得られると思ったんだけどないものはしかたがない (し、ほかで探すのは面倒な)ので、先に見たコマンドラインオプション解析部分のコメントにあった Inspired by the same option in Hume’s gre. からあたってみることにした。

Andrew Hume

結論から書くと、Hume が人の名前(Andrew Hume)で、gre は彼(単独ではないかもしれないけど)が書いた grep のようなツールの名前。特にツールの名前はググりにくくて苦労したけどこの Humeさんは grep に関係するところで意外に見かける人だった。思いがけないところではProgramming Perl にも登場していた。「有名なUNIX哲学者」らしいのだけど、Wikipediaに独立した項目は立てられていないみたい。

Humeさんとgrepの関りはたとえば grep - Wikipedia, the free encyclopedia では

Hume, Andrew A tale of two greps, Software - Practice and Experience 18, (11), 1063-1072 (1988).
Hume, Andrew Grep wars: The strategic search initiative. In Peter Collinson, editor, Proceedings of the EUUG Spring 88 Conference, pages 237?245, Buntingford, UK, 1988. European UNIX User Group.

という著作が挙げられている。前者は検索すればすぐに見つけられる (A tale of two greps - Hume - 1988 - Software: Practice and Experience - Wiley Online Library や A tale of two greps | Software?Practice & Experience など)のだけど、残念ながら「paywall」の向こう側でした。

また、10年ほど前(もうそんなに経ってたのか!)に話題になった GNU grepが高速な理由 | マイナビニュースやインテル・AMDのCPUアーキテクトが明かす： GNU grep が速い理由 - karasuyamatenguの日記の話題でも名前が出ていた。

why GNU grep is fast

See “Fast String Searching”, by Andrew Hume and Daniel Sunday,
in the November 1991 issue of Software Practice & Experience, for
a good discussion of Boyer-Moore implementation tricks. It’s
available as a free PDF online.

これね。こっちの論文はフリーにアクセスできるところにあるので興味がある人はどうぞ。

gre

さて作った人の方はこれくらいにして、ツールの方を。

GNU grep のコメントにあったものと同じという直接の証拠は残念ながらないのだけど、gre は gre page from Section 1 of the unix 10th manual で間違いないと思う。 Hume さん UNIX 10th に関わっていたのは確かなようだし。見つかった gre のマニュアルページはこんな感じだった。

     GRE(1)                                                     GRE(1)

     NAME
          gre, grep, egrep, fgrep - search a file for a pattern

     SYNOPSIS
          gre [ option ... ] pattern [ file ... ]

          grep [ option ... ] pattern [ file ... ]

          egrep [ option ... ] pattern [ file ... ]

          fgrep [ option ... ] strings [ file ... ]

     DESCRIPTION
          Gre searches the input files (standard input default) for
          lines (with newlines excluded) that match the pattern, a
          regular expression as defined in re(3). A file name of - is
          interpreted as standard input.  Normally, each line matching
          the pattern is `selected', and each selected line is copied
          to the standard output.  The options are

          -1   Print only the first selected line of each file argu-
               ment.
          -b   Mark each printed line with its byte position in its
               file.  This is sometimes useful in locating patterns in
               non-text files.
          -c   Print only a count of matching lines.
          -e pattern
               Same as a simple pattern argument, but useful when
               pattern begins with a -.
          -E   Simulate egrep.
          -f file
               Read the pattern from file; there is no pattern argu-
               ment
          -F   Simulate fgrep.
          -G   Simulate grep.
          -h   Do not print filename tags (headers) with output lines.
          -i   Ignore alphabetic case distinctions.
          -l   Print the names of files with selected lines; don't
               print the lines.
          -L   Print the names of files with no selected lines; the
               converse of -l.
          -n   Mark each printed line with its line number counted in
               its file.
          -s   Produce no output, but return status.
          -v   Reverse: print lines that do not match the pattern.
          -x   Exact match: The pattern is ^(pattern)$.  The implicit
               parentheses count in back references.

void
usage (int status)
{
  if (status != 0)
    fprintf (stderr, _("Try `%s --help' for more information.\n"),
	     program_name);
  else
    {
      printf (_("\
Usage: %s [OPTION]... SET1 [SET2]\n\
"),
	      program_name);
      fputs (_("\
Translate, squeeze, and/or delete characters from standard input,\n\
writing to standard output.\n\
\n\
  -c, --complement        first complement SET1\n\
  -d, --delete            delete characters in SET1, do not translate\n\
  -s, --squeeze-repeats   replace each input sequence of a repeated character\n\
                            that is listed in SET1 with a single occurrence\n\
                            of that character\n\
  -t, --truncate-set1     first truncate SET1 to length of SET2\n\
"), stdout);
      fputs (HELP_OPTION_DESCRIPTION, stdout);
      fputs (VERSION_OPTION_DESCRIPTION, stdout);
      fputs (_("\
\n\
SETs are specified as strings of characters.  Most represent themselves.\n\
Interpreted sequences are:\n\
\n\
  \\NNN            character with octal value NNN (1 to 3 octal digits)\n\
  \\\\              backslash\n\
  \\a              audible BEL\n\
  \\b              backspace\n\
  \\f              form feed\n\
  \\n              new line\n\
  \\r              return\n\
  \\t              horizontal tab\n\
"), stdout);
     fputs (_("\
  \\v              vertical tab\n\
  CHAR1-CHAR2     all characters from CHAR1 to CHAR2 in ascending order\n\
  [CHAR*]         in SET2, copies of CHAR until length of SET1\n\
  [CHAR*REPEAT]   REPEAT copies of CHAR, REPEAT octal if starting with 0\n\
  [:alnum:]       all letters and digits\n\
  [:alpha:]       all letters\n\
  [:blank:]       all horizontal whitespace\n\
  [:cntrl:]       all control characters\n\
  [:digit:]       all digits\n\
"), stdout);
     fputs (_("\
  [:graph:]       all printable characters, not including space\n\
  [:lower:]       all lower case letters\n\
  [:print:]       all printable characters, including space\n\
  [:punct:]       all punctuation characters\n\
  [:space:]       all horizontal or vertical whitespace\n\
  [:upper:]       all upper case letters\n\
  [:xdigit:]      all hexadecimal digits\n\
  [=CHAR=]        all characters which are equivalent to CHAR\n\
"), stdout);
     fputs (_("\
\n\
Translation occurs if -d is not given and both SET1 and SET2 appear.\n\
-t may be used only when translating.  SET2 is extended to length of\n\
SET1 by repeating its last character as necessary.  \
"), stdout);
     fputs (_("\
Excess characters\n\
of SET2 are ignored.  Only [:lower:] and [:upper:] are guaranteed to\n\
expand in ascending order; used in SET2 while translating, they may\n\
only be used in pairs to specify case conversion.  \
"), stdout);
     fputs (_("\
-s uses SET1 if not\n\
translating nor deleting; else squeezing uses SET2 and occurs after\n\
translation or deletion.\n\
"), stdout);
      printf (_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
    }
  exit (status == 0 ? EXIT_SUCCESS : EXIT_FAILURE);
}

7th UNIX やそれ以前のUNIXのソースコードが自由にみられるようになっていたのは知っていたのだけど、8th, 9th, 10th もあったんですな。ということでありがたく 10th のところで検索するとありました gre のソース。

greを構成する複数あるファイルの中から V10/cmd/gre/gre.reply を見ると

The following is a summary of the somewhat plausible ideas
suggested for the new grep. I thank leo de witt particularly and others
for clearing up misconceptions and pointing out (correctly) that
existing tools like sed already do (or at least nearly do) what some people
asked for. The following points are in no particular order and no slight is
intended by my presentation.

-lv (files that don’t have any matching lines)
-lv means print names of files that have any nonmatching lines
(useful, say, for checking input syntax). -L will mean print
names of files without selected lines.

とあるので、-l -v を組み合わせたときの動作は最初からこのように意図されていたものと考えていいんじゃないですかね。

過去の grepから

gre と grep の両方があるV10を含め過去に遡っていくと、 -l と -v の二つがそろったのはV7の時点だった (V6には-vだけ)。以下それぞれのバージョンのgrepへのリンク。興味のある向きは比較してみてちょーだい。

-l が追加された V7 のソースを見るとこんな感じ(多少の変化はあるけどV10に至るまで基本的な流れは同じ)。個々の処理対象ファイルは execute という関数で扱っていて、

execute(file)
char *file;
{
	register char *p1, *p2;
	register c;

	if (file) {
		if (freopen(file, "r", stdin) == NULL)
			errexit("grep: can't open %s\n", file);
	}
	lnum = 0;
	tln = 0;
	for (;;) {
		lnum++;
		p1 = linebuf;
		while ((c = getchar()) != '\n') {
			if (c == EOF) {
				if (cflag) {
					if (nfile>1)
						printf("%s:", file);
					printf("%D\n", tln);
				}
				return;
			}
			*p1++ = c;
			if (p1 >= &linebuf[LBSIZE-1])
				break;
		}
		*p1++ = '\0';
		p1 = linebuf;
		p2 = expbuf;
		if (circf) {
			if (advance(p1, p2))
				goto found;
			goto nfound;
		}
		/* fast check for first character */
		if (*p2==CCHR) {
			c = p2[1];
			do {
				if (*p1!=c)
					continue;
				if (advance(p1, p2))
					goto found;
			} while (*p1++);
			goto nfound;
		}
		/* regular algorithm */
		do {
			if (advance(p1, p2))
				goto found;
		} while (*p1++);
	nfound:
		if (vflag)
			succeed(file);
		continue;
	found:
		if (vflag==0)
			succeed(file);
	}
}

通常時に与えたパターンにマッチした場合や -v 指定時に与えたパターンとマッチしなかったときには、 succeed という関数が呼び出されている。


succeed(f)
char *f;
{
	long ftell();
	nsucc = 1;
	if (sflag)
		return;
	if (cflag) {
		tln++;
		return;
	}
	if (lflag) {
		printf("%s\n", f);
		fseek(stdin, 0l, 2);
		return;
	}
	if (nfile > 1 && hflag)
		printf("%s:", f);
	if (bflag)
		printf("%ld:", (ftell(stdin)-1)/BSIZE);
	if (nflag)
		printf("%ld:", lnum);
	printf("%s\n", linebuf);
}

関数 succeed ではフラグ設定に応じて色々やっているけど -l 指定時はファイル名を出力している。

結論

まあそういうことで。

≪ prev POSIX.STD

next ≫ 小ネタ