ときどきの雑記帖 RE* (新南口)

The Empire Strikes Back

November 15, 2021

今週のしずえさん

「鴉」という漢字の由来

ドレッドノート

聴いている余裕がなくて何本か溜まってしまっていたrebuild.fmをまとめて聴いた。のだけど、hakさんのドレッドノート (HMS Dreadnought (1906) - Wikipedia)の解説にちょっとモヤモヤした。

gsub(//, “@”)

正しく理解できる！シェルスクリプトの正規表現（令和最新版） - Qiita と言う記事の空の正規表現の意味に

おそらく全ての awk の実装で、空の正規表現指定すると以下のように文字の境界にマッチしているようなのですが、これが POSIX で規定されているかどうかは不明）です。（今の所書いてある場所を見つけていない
$ echo "abc" | gawk  '{ gsub(//, "@"); print}'
@a@b@c@

という記述があるけどこれはあれだ。文字と文字の境界にマッチしているんじゃなくて、実際には文字列先頭の空文字にマッチしているんだけど gsubには「暗黙のループ」があって空文字列にマッチした場合には対象の開始ポイントを強制的に一文字分進めて次の「ループ」に行くので境界にマッチしているように見えるという話だ (そうしないと永遠に「ループ」から脱出できない。いや少なくとももう一つは対処法があるか)。

んで、gawkのソースを見るとこんなコメントがある。

/* do_sub --- do the work for sub, gsub, and gensub */

/*
 * Gsub can be tricksy; particularly when handling the case of null strings.
 * The following awk code was useful in debugging problems.  It is too bad
 * that it does not readily translate directly into the C code, below.
 *
 * #! /usr/local/bin/mawk -f
 *
 * BEGIN {
 * 	true = 1; false = 0
 * 	print "--->", mygsub("abc", "b+", "FOO")
 * 	print "--->", mygsub("abc", "x*", "X")
 * 	print "--->", mygsub("abc", "b*", "X")
 * 	print "--->", mygsub("abc", "c", "X")
 * 	print "--->", mygsub("abc", "c+", "X")
 * 	print "--->", mygsub("abc", "x*$", "X")
 * }
 *
 * function mygsub(str, regex, replace,	origstr, newstr, eosflag, nonzeroflag)
 * {
 * 	origstr = str;
 * 	eosflag = nonzeroflag = false
 * 	while (match(str, regex)) {
 * 		if (RLENGTH > 0) {	# easy case
 * 			nonzeroflag = true
 * 			if (RSTART == 1) {	# match at front of string
 * 				newstr = newstr replace
 * 			} else {
 * 				newstr = newstr substr(str, 1, RSTART-1) replace
 * 			}
 * 			str = substr(str, RSTART+RLENGTH)
 * 		} else if (nonzeroflag) {
 * 			# last match was non-zero in length, and at the
 * 			# current character, we get a zero length match,
 * 			# which we don't really want, so skip over it
 * 			newstr = newstr substr(str, 1, 1)
 * 			str = substr(str, 2)
 * 			nonzeroflag = false
 * 		} else {
 * 			# 0-length match
 * 			if (RSTART == 1) {
 * 				newstr = newstr replace substr(str, 1, 1)
 * 				str = substr(str, 2)
 * 			} else {
 * 				return newstr str replace
 * 			}
 * 		}
 * 		if (length(str) == 0)
 * 			if (eosflag)
 * 				break
 * 			else
 * 				eosflag = true
 * 	}
 * 	if (length(str) > 0)
 * 		newstr = newstr str	# rest of string
 *
 * 	return newstr
 * }
 */

C の部分でも(例によって行頭のインデントは削除している)

if (global || current == how_many) {
	/*
	 * If the current match matched the null string,
	 * and the last match didn't and did a replacement,
	 * and the match of the null string is at the front of
	 * the text (meaning right after end of the previous
	 * replacement), then skip this one.
	 */
	if (matchstart == matchend
	    && lastmatchnonzero
	    && matchstart == text) {
		lastmatchnonzero = false;
		matches--;
		goto empty;
	}

や

empty:
	/* catch the case of gsub(//, "blah", whatever), i.e. empty regexp */
	if (matchstart == matchend && matchend < text + textlen) {
		*bp++ = *matchend;
		matchend++;
	}
	textlen = text + textlen - matchend;
	text = matchend;

のようにしている(と言ってもこれだけでは情報量が少なくてよくわからないだろうけど)。また、フィールド分割の方でも

/*
 * re_parse_field --- parse fields using a regexp.
 *
 * This is called both from get_field() and from do_split()
 * via (*parse_field)().  This variation is for when FS is a regular
 * expression -- either user-defined or because RS=="" and FS==" "
 */
static long
re_parse_field(long up_to,	/* parse only up to this field number */
	char **buf,	/* on input: string to parse; on output: point to start next */
	int len,
	NODE *fs ATTRIBUTE_UNUSED,
	Regexp *rp,
	Setfunc set,	/* routine to set the value of the parsed field */
	NODE *n,
	NODE *sep_arr,  /* array of field separators (maybe NULL) */
	bool in_middle)
{

という関数で

while (scan < end
       && research(rp, scan, 0, (end - scan), regex_flags) != -1
       && nf < up_to) {
	regex_flags |= RE_NO_BOL;
	if (REEND(rp, scan) == RESTART(rp, scan)) {   /* null match */
		if (gawk_mb_cur_max > 1)	{
			mbclen = mbrlen(scan, end-scan, &mbs);
			if ((mbclen == 1) || (mbclen == (size_t) -1)
				|| (mbclen == (size_t) -2) || (mbclen == 0)) {
				/* We treat it as a singlebyte character.  */
				mbclen = 1;
			}
			scan += mbclen;
		} else
			scan++;
		if (scan == end) {
			(*set)(++nf, field, (long)(scan - field), n);
			up_to = nf;
			break;
		}
		continue;
	}

のようにして空文字による分割に対応している。

Single Character Fields

空文字によるフィールド分割やsplitに関しても awk - pattern scanning and processing language には見当たらないけどgawkのドキュメントに何か所か記載があって、こんな感じ。

@node Single Character Fields
@subsection Making Each Character a Separate Field

@cindex common extensions @subentry single character fields
@cindex extensions @subentry common @subentry single character fields
@cindex differences in @command{awk} and @command{gawk} @subentry single-character fields
@cindex single-character fields
@cindex fields @subentry single-character
There are times when you may want to examine each character
of a record separately.  This can be done in @command{gawk} by
simply assigning the null string (@code{""}) to @code{FS}. @value{COMMONEXT}
In this case,
each individual character in the record becomes a separate field.
For example:

@example
$ @kbd{echo a b | gawk 'BEGIN @{ FS = "" @}}
>                  @kbd{@{}
>                      @kbd{for (i = 1; i <= NF; i = i + 1)}
>                          @kbd{print "Field", i, "is", $i}
>                  @kbd{@}'}
@print{} Field 1 is a
@print{} Field 2 is
@print{} Field 3 is b
@end example

@cindex dark corner @subentry @code{FS} as null string
@cindex @code{FS} variable @subentry null string as
Traditionally, the behavior of @code{FS} equal to @code{""} was not defined.
In this case, most versions of Unix @command{awk} simply treat the entire record
as only having one field.
@value{DARKCORNER}
In compatibility mode
(@pxref{Options}),
if @code{FS} is the null string, then @command{gawk} also
behaves this way.

@node Field Splitting Summary
@subsection Field-Splitting Summary


省略

@item FS == ""
Each individual character in the record becomes a separate field.
(This is a common extension; it is not specified by the POSIX standard.)
@end table

@cindex @code{FS} variable
@cindex separators @subentry field
@cindex field separator
@item FS
The input field separator (@pxref{Field Separators}).
The value is a single-character string or a multicharacter regular
expression that matches the separations between fields in an input
record.  If the value is the null string (@code{""}), then each
character in the record becomes a separate field.
(This behavior is a @command{gawk} extension. POSIX @command{awk} does not
specify the behavior when @code{FS} is the null string.
Nonetheless, some other versions of @command{awk} also treat
@code{""} specially.)

it is not specified by the POSIX standard

とか

POSIX @command{awk} does not specify the behavior when @code{FS} is the null string.

とありますね。そしてsplitについてはこちら。

@item @code{split(@var{string}, @var{array}} [@code{, @var{fieldsep}} [@code{, @var{seps}} ] ]@code{)}
@cindexawkfunc{split}

省略

@cindex differences in @command{awk} and @command{gawk} @subentry @code{split()} function
As with input field-splitting, when the value of @var{fieldsep} is
@w{@code{" "}}, leading and trailing whitespace is ignored in values assigned to
the elements of
@var{array} but not in @var{seps}, and the elements
are separated by runs of whitespace.
Also, as with input field splitting, if @var{fieldsep} is the null string, each
individual character in the string is split into its own array element.
@value{COMMONEXT}
Additionally, if @var{fieldsep} is a single-character string, that string acts
as the separator, even if its value is a regular expression metacharacter.

重箱の隅

ところで同じ記事の空の正規表現()や(exp1|)は使用しないにあるこれ

A <vertical-line> appearing first or last in an ERE, or immediately following a <vertical-line> or a <left-parenthesis>, or immediately preceding a <right-parenthesis>, produces undefined results.

訳 ERE の最初または最後に | がある場合、または (| や |) は未定義の結果になる

or immediately following a <vertical-line> が訳から抜け落ちているような。

sleep

GNU CoreUtils の sleep は infinity が指定できる

GNU CoreUtils の sleep コマンド

$ sleep 3s

みたいな指定の方法があるのは知ってたけど

$ sleep infinity

で永久に待つの知らなかった。
— mattn (@mattn_jp) November 12, 2021

ソースコードを読んで確かめたのなら動作としてはそうなのだろうけど、 strtodのその仕様を失念して使ってたりはしないか (つまりは意図していない動作、バグではないか)ということが気になったのでマニュアルを確かめたのだけど

25.1 sleep: Delay for a specified time - GNU Coreutils 9.0

Also one could sleep indefinitely like:
sleep inf

なるほど。

Linuxでsleep(1)に大きな値を指定すると24日ずつnanosleep(3)する

gnulib/nanosleep.c at dd0af10fa597a95ffe5f4f110ef5edefc2f680bc · coreutils/gnulib

cygwin 1.5.x, which can’t sleep more than 49.7 days (2^32 milliseconds).

49.7日。どこかで見たような数字ですな😄

Hugo メモ

一応スマートフォンやタブレットでもどうなっているか確認したいけど本番環境でしかアクセスできないのはなあ。と悩んでいたが、Hugoサーバー起動時のオプション指定でうまくやれるらしい。

> hugo server --help
Hugo provides its own webserver which builds and serves the site.
While hugo server is high performance, it is a webserver with limited options.
Many run it in production, but the standard behavior is for people to use it
in development and use a more full featured server such as Nginx or Caddy.

'hugo server' will avoid writing the rendered and served content to disk,
preferring to store it in memory.

By default hugo will also watch your files for any changes you make and
automatically rebuild the site. It will then live reload any open browser pages
and push the latest content to them. As most Hugo sites are built in a fraction
of a second, you will be able to save and see your changes nearly instantly.

Usage:
  hugo server [flags]

Aliases:
  server, serve

Flags:
      --appendPort             append port to baseURL (default true)
  -b, --baseURL string         hostname (and path) to the root, e.g. http://spf13.com/
      --bind string            interface to which the server will bind (default "127.0.0.1")
  -D, --buildDrafts            include content marked as draft
以下略

なるほど、この--baseURL と--bindを使うのか。

≪ prev The Running Man

next ≫ The Sky Crawlers