ときどきの雑記帖″ 2014年3月(上旬)

2014年03月09日

■_

arton さんの目黒の話を読んで、目黒駅と目黒川ってそんなに近くにあったっけと思ったら自分の記憶違いだった。山手通りの向こう側(目黒駅から見て)と思ってたわー。たぶん246近辺の記憶に引きずられてたんだろう＞位置関係

ビニールポットに植えられてるカタクリが近くの花屋(というか種苗店)で売られているのを発見。何回かチャレンジしたけど、翌年、翌々年くらいまでしか咲かせられなかったんだよなあ。買っても。＞カタクリ

■_ COBOL本

Amazon.co.jp： Beginning Cobol for Programmers: Michael Coughlan: 洋書ちょっと気になる。

Amazon.co.jp： Beginning Cobol for Programmers: Michael Coughlan: 洋書

Beginning COBOL for Programmers is a comprehensive, sophisticated tutorial and modular skills reference on the
COBOL programming language for established programmers. This book is for you if you are a developer who would
like to - or must - add COBOL to your repertoire. Perhaps you recognize the opportunities presented by the
current COBOL skills crisis, or you may be working in a mission critical enterprise which retains legacy COBOL
applications. Whatever your situation, Beginning COBOL for Programmers meets your needs as an established
programmer moving to COBOL.

で、 grep.git - grep 改めて Jim さんの 1月10日パッチを見ると、本当に trivial_ignore_case に関連するところだけですね。関数本体と、それに付随するマクロ、関数を呼び出しているところ。で、trivial_ignore_case を呼び出したところで match_icase を 0 にしているのだからこの変数を参照しているところが怪しいということでざっと grep してみると…

dfasearch.c:82:  const char *buf = (match_icase && MB_CUR_MAX > 1
dfasearch.c:135:  if (match_icase)
dfasearch.c:138:  dfasyntax (syntax_bits, match_icase, eolbyte);
dfasearch.c:225:      if (match_icase)
kwsearch.c:38:  char const *pat = (match_icase && MB_CUR_MAX > 1
kwsearch.c:91:      if (match_icase)
main.c:353:int match_icase;
main.c:1075:  if (MB_CUR_MAX == 1 || !match_icase)
main.c:2050:        match_icase = 1;
pcresearch.c:57:  int flags = PCRE_MULTILINE | (match_icase ? PCRE_CASELESS : 0);
searchutils.c:31:  if (match_icase && MB_CUR_MAX == 1)
grep.h:42:extern int match_icase;		/* -i */

main.c の 353 行目は宣言で、2050行目のはオプション処理で設定しているところでしょうね。そして 1075行目付近を見ると

static size_t
do_execute (char const *buf, size_t size, size_t *match_size,
            char const *start_ptr)
{
  size_t result;
  const char *line_next;

  /* With the current implementation, using --ignore-case with a multi-byte
     character set is very inefficient when applied to a large buffer
     containing many matches.  We can avoid much of the wasted effort
     by matching line-by-line.

     FIXME: this is just an ugly workaround, and it doesn't really
     belong here.  Also, PCRE is always using this same per-line
     matching algorithm.  Either we fix -i, or we should refactor
     this code---for example, we could add another function pointer
     to struct matcher to split the buffer passed to execute.  It would
     perform the memchr if line-by-line matching is necessary, or just
     return buf + size otherwise.  */
  if (MB_CUR_MAX == 1 || !match_icase)
    return execute (buf, size, match_size, start_ptr);

  for (line_next = buf; line_next < buf + size; )
    {
      const char *line_buf = line_next;
      const char *line_end = memchr (line_buf, eolbyte,
                                     (buf + size) - line_buf);
      if (line_end == NULL)
        line_next = line_end = buf + size;
      else
        line_next = line_end + 1;

      if (start_ptr && start_ptr >= line_end)
        continue;

      result = execute (line_buf, line_next - line_buf, match_size, start_ptr);
      if (result != (size_t) -1)
        return (line_buf - buf) + result;
    }

  return (size_t) -1;
}

あ。こいつか。ところでコメント文を改めてよく読んでみると

  /* With the current implementation, using --ignore-case with a multi-byte
     character set is very inefficient when applied to a large buffer
     containing many matches.  We can avoid much of the wasted effort
     by matching line-by-line.

multi-byte character を使っているときに --ignore-case つけた場合、 many matches を抱えた large buffer を使おうとすると…とかありますね。やっぱりここではトルコ語云々は関係ないんじゃなかろうか。

ということで、applied to a large buffer containing many matches. の詳細について追いかけてみることにしましょう。

またしても続く。

■_

あ、タナカさんがまた dfa 関連のパッチを bug#16966: [PATCH] grep: optimization with the superset of DFA bug#16966: [PATCH] grep: optimization with the superset of DFA

2014年03月08日

■_

教養としてのプログラミング講座 (中公新書ラクレ 489)
この本、見かけたんで買って読んだ(サクサク読めた)けど、解析機関と階差機関ごっちゃになっているような。まあ本文とは関係ない部分なんだけど (そしてそういうところが妙に気になってしまうワタシ)

■_

さて。

2014-01-02	version 2.16v2.16	Jim Meyering	1	-1/+1
2014-01-02	maint: post-release administrivia	Jim Meyering	3	-2/+5
2014-01-07	Port to Solaris 10 /bin/sh.	Paul Eggert	4	-3/+4
2014-01-08	tests: port Solaris 10 /bin/sh patch back to GNU/Linux	Paul Eggert	4	-4/+5
2014-01-10	grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales	Jim Meyering	4	-1/+160
2014-01-11	tests: remove superfluous uses of printf	Pádraig Brady	1	-2/+2
2014-01-17	grep: add undocumented '-X gawk' and '-X posixawk' options	Aharon Robbins	1	-0/+14
2014-01-17	grep: DFA now uses rational ranges in unibyte locales	Paul Eggert	3	-28/+19
2014-01-25	gnulib: update to latest	Jim Meyering	1	-0/+0
2014-01-26	dfa: remove GREP-ifdef'd code in favor of code used by gawk	Norihiro Tanaka	1	-8/+0
2014-01-26	dfasearch: skip kwset optimization when multi-byte+case-insensitive	Norihiro Tanaka	2	-38/+19
2014-01-26	maint: move two local variable declarations	Jim Meyering	1	-4/+2
2014-01-27	maint: remove vestiges of support for long-disabled --mmap option	Jim Meyering	6	-58/+9
2014-02-01	maint: use to_uchar function rather than explicit casts	Jim Meyering	6	-8/+17
2014-02-10	speed up mb-boundary-detection after each preliminary match	Norihiro Tanaka	4	-4/+52
2014-02-11	help: fix a line ending, and use the same word for similar things	Benno Schulenberg	1	-2/+2
2014-02-11	help: remove surplus newline	Benno Schulenberg	1	-1/+1
2014-02-17	maint: ignore configure.lineno	Mike Frysinger	1	-0/+1
2014-02-18	revert "grep: DFA now uses rational ranges in unibyte locales"	Paolo Bonzini	2	-11/+18
2014-02-18	version 2.17v2.17	Jim Meyering	1	-1/+1

2.16 から 2.17 までの履歴を抜き出すとこういう感じ。この中でも今回話題になっているのとは明らかに関係ないだろうというのもあるし、Jim さんの変更とタナカさんの変更が互いに影響及ぼしちゃって逆効果になっているというのもあったので、まずは 2.16 を起点にして、Jim さんの1月10日の変更に絞って見るとしましょうか (一回やったけどね)。

■_ 360

Wirth センセイ作のプログラミング言語に PL360 というのがありまして。 PL360 - Wikipedia, the free encyclopedia 最近その存在を知ったのですが言語仕様を見ているとなかなか面白い。昔懐かし Turbo-C の _AX、_BX みたいな「レジスター変数」を持っていたり、まるで BASIC の on error goto みたいな case 文とか。 Wirth Languages Niklaus Wirth's PL360 and the Burroughs B5500 - Google Groups

2014年03月07日

■_

NHK DVD MIT白熱教室 DVD BOX
MIT白熱教室(再放送)。虹の話の回。やっぱこの回はサイコーですわ。主虹・副虹・過剰虹そしてアレクサンダーの暗帯 : htt'sフィールド

JaSSTソフトウェアテストシンポジウム-JaSST'14 Tokyo 会社の金で行けたという話もあったんですが、開催が金曜土曜で土曜日に会社の金で行くとなると休日出勤扱いになって… と面倒な話になりそうだったので以下略。

なかなか面白いスライドだった。しかし TSUNAMI… Excel Coding Errors Are Destroying World Economies and F# (with Tsunami) Is Here to Stop Them!

アポロ13 [Blu-ray]
BS でやってたので観てしまった。アポロ13 - Wikipedia を読んでたら「ゴードン・クーパー」という名前を見つけて、ゴードン・クーパー - Wikipedia アメリカ初の宇宙飛行士として選ばれた7人「オリジナル・セブン」のひとり。マーキュリー計画やジェミニ計画に参加した。あ、やっぱり「ライト・スタッフ」の最後でロケット乗ってた人だ。

■_

どうまとめるか思案中 grep.git - grep

■_

2014年03月06日

■_

よく、チューリング賞はコンピューター関連のノーベル賞みたいなものだ。とか聞くじゃないですか。これはまた日本でだけいわれてることなんじゃないだろうか誰が言い始めたんだろうかなどと思っていたのですが、 Turing Award - Wikipedia, the free encyclopedia に The Turing Award is recognized as the "highest distinction in Computer science" and "Nobel Prize of computing". という記述が!

■_

「ガンダムを創った男たち」が漫画に　富野氏、安彦氏、大河原氏――アニメの常識覆した3人の「伝記」 (1/2) - ITmedia ニュース
技術者であることと自分にとって大事なことって - (define -ayalog '())
Pythonにサヨナラを - HackerNews翻訳してみた
Excel Coding Errors Are Destroying World Economies and F# (with Tsunami) Is Here to Stop Them!
Matthew Moloney discusses using F# and .NET inside Excel, demonstrating doing big data, cloud computing, using GPGPU and compiling F# Excel UDFs.
How is competitive programming different from real-life programming? | Techoholics

■_

grep.git - grep _UNUSED_PARAMETER_ ってなんだっけ。名前から見当はつくけど grep.git - grep

■_

進捗ダメでしたー

2014年03月05日

■_

Technology Radar January 2014 | ThoughtWorks だいぶ前になってしまいましたが。 Technology Radar January 2014 | ThoughtWorks の assessment のところにいくつか興味深い名前がありますね。しかし、英語版、ポルトガル語版があって Chinese, Spanish and German PDFs coming soon. ですか。ううむ。

はてなブックマーク - ソフトウェア工学 - 東京大学大学院総合文化研究科玉井哲雄の先にあるpdfの、第12章信頼性のところになんたら曲線のお話がちょっとあります。

なんだろこの本。 Amazon.co.jp：これならなっとく C言語へのいざない: 白井豊: 本

■_ malloc した領域を

C++ versus V8 versus luajit versus C benchmark - (hash) tables というのがありまして、その中に C++ versus V8 versus luajit versus C benchmark - (hash) tables C バージョンがあって、 C++ versus V8 versus luajit versus C benchmark - (hash) tables The C version has memory leaks, so, it is not 100% correct and thus the benchmark info isn't precise. というコメントをつけた人が Here's the corrected C version: というのを書いたら Freeing memory before end of program is a waste of time. という反応があってそれに it is nonsense (no matter that in some scenarios it might work without a severe impact). It is praising the ignorance which good programmers avoid. と返してたりしてこういうのは洋の東西問わないのだなあと。

■_

なぜかこのタイミングで reddit に。リリース直後にもスレッドあったような気がするけど。 Version 2.17 of the GNU grep utility is out. "This release is notable for its performance improvements: we don't often see a 10x speed-up in a tool like grep" : programming

■_

HN Search powered by Algolia certificate verification vulnerability in all GnuTLS versions CVE-2014-0092 : netsec existential type crisis : The Story of the GnuTLS Bug

The myth of the fall

2014年03月04日

■_

しらなかったー渋谷金王八幡宮神社 JR渋谷駅から東へ首都高速線沿いに徒歩で数分。右に入る。暦学者、渋川春海（安井算哲）を描いた小説「天地明察」の冒頭から出でくる神社。

金王八幡宮公式ホームページ

■_ Which programming language should you learn?

Which programming language should you learn? | Py Skool これ自体はよくある題材ですが

Which programming language should you learn? | Py Skool

(ざっくり略)

How do you choose a programming language?

So the question to ask yourself is, what do you want to accomplish?

1. Want to write low level code for an embedded system? Use C, maybe C++. Not assembly. No matter how cool you may have heard it is.

2. Want to write enterprise level code (that solves business problems)? Remember, the people who slap and abuse
you, force you to sit in a chair for 8 hours, and then throw you a few pieces of silver? Well, they are the
ones who decide this. So shut up and go back to work.

3. Want to write games? Which games?

    Games that run on your PC or on consoles like XBox- C++, C#
    Online games – Flash (try to avoid this), javascript. Maybe a scripting language like Python or Ruby.

4. Web applications? There is  a group of languages you may have to learn. Javascript, PHP, Python, Ruby, Perl
(though not all of them).

If in doubt, choose Python. It is simple to learn, and easy to read. It is a well supported language, with
plenty of tutorials and books.

· © 2014 Py Skool · Designed by Themes & Co ·

low level code でも Use C, maybe C++. Not assembly. と。

■_ GNU grep

さて落としどころをどこに持って行こう。

まー、後二回か三回くらいで締めたいなーとは思っているんですが、そうするとそこでなにをどのように書いたものかという。 GNU bug reports: package grep 2.16からの変更追いかける…かなあ。

ところで #16893 - [PATCH] Avoid matching line-by-line for case-insensitive with grep - GNU bug report logs によると

#16893 - [PATCH] Avoid matching line-by-line for case-insensitive with grep - GNU bug report logs


Now grep and awk matchers doesn't waste buffer in case-sensisitive matching.
So I think that we can avoid line-by-line matching for them.

It enable to speed up case-sensitive matching with grep or awk matcher
without trivial_case_ignore as fast as when with it.

In bug#16232:
> The following times 2.16, 2.17 and 2.17+patch two ways:
> 
> $ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
> $ for i in 16 17 18; do echo $i; env LC_ALL=en_US.UTF-8 time
> /p/p/grep-2.$i/bin/grep -i foobar k; done
> 16
>        15.96 real        14.57 user         0.12 sys
> 17
>         1.13 real         1.07 user         0.06 sys
> 18
>         1.96 real         1.89 user         0.06 sys
> 
> The above search takes more than 70% longer with the proposed patch.

Therefore, I think 30% slow-down is caused by the line-by-line matching for them.

一行ごとに処理(読み込みではないのは以前にも書いた通り)することによる速度低下がどのくらいなのかここから見当つきますね。誰だよこれで10倍速とか吹聴してたの…

■_

2014年03月03日

■_

ラーメン+半チャーハンのセットってあるじゃないですか。これ、チャーハン+ハーフサイズのラーメンのセットとかないですかね。ないかー。

東横線某駅にホームドアが設置されてた。稼働はまだだけど。

■_

歌舞伎町で流れる軍事介入ニュースとバーチャルアイドルの映像。 pic.twitter.com/r2ovkAqgS9
— けーの (@kno2502) 2014, 3月 2

歌舞伎町はしばらくそばにも行ってないけどこんなのがついたのかー

■_ 今日のGNU grep

ウオッチしてどうする○|￣|＿

trivial_ignore_case() が復活してる grep.git - grep grep.git - grep

ちょっと落ち着いてからのがいいかなー bug-grep (date)

■_

Pocket に貯めてる記事の数が…

2014年03月02日

■_

お、なんかこう言う話が進んでたのかー「歴史群像シリーズ」約500タイトルの電子書籍化を開始第1弾は2月20日に「黒田官兵衛」関連5タイトルをリリース | 株式会社ブックビヨンドで、第一弾がこれと。黒田官兵衛特集（歴史群像デジタルアーカイブス） | 学研電子ストアWeb

歴史群像―学研デジタル歴史館－｢信長の独断｣　by大野信長デーモン・ヒル結構好きだったんだよなあ…

やる夫と学ぶホビーパソコンの歴史　第22話: 11289 Bytes free - タイニーPのブログ

■_

BSD では ignore-case な検索のときに今回話題になったようなやり方を昔からしていたとかいう話を伝え聞きまして。しかし、GNU regex (昔のね。0.12とか) でも translate テーブル使ってごにょごにょしてたよなあ…と思いつつ昔の GNU grep どうだったか確かめようとしたら 2.0からしかなかった○|￣|＿ Index of /pub/GNU/grep gawkは 1.01 ってのがあるんですけどねえ(日付はともかく) Index of /pub/GNU/gawk

さて。 #16232 - [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales - GNU bug report logs のやり取りを追いかけてみよう。問題の十倍高速化云々の話題のやり取りがあるところ。メッセージの番号が所々抜けてるのはなんなんだろう(先頭が#5だし)

#16232 - [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales - GNU bug report logs

GNU bug report logs - #16232
[PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

Reported by: Jim Meyering

Date: Mon, 23 Dec 2013 22:40:02 UTC

To reply to this bug, email your comments to 16232 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering
To: bug-grep <at> gnu.org
Subject: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Mon, 23 Dec 2013 14:39:14 -0800

FYI, here is a quick and clean/safe performance improvement for grep -i.
I expect to push this commit right after the upcoming bug-fix release.
Currently, this optimization is enabled when the search string is
ASCII and contains neither of '\' (backslash) nor '['.  I expect to
eliminate the latter two constraints in a follow-on commit including
tests to exercise all of the corner cases.

Happy holidays,
Jim

どうやらここが始まりのようですね。以下適当に、引用部分などを削ります。

Message #8 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Mon, 23 Dec 2013 15:52:26 -0700

On 12/23/2013 03:39 PM, Jim Meyering wrote:
> +
> +  /* Worst case is that every byte of keys will be alpha,
> +     so every byte B will map to the sequence of 4 bytes [Bb].  */

Umm, is this always true?  Consider the UTF-8 Turkish locale, where
single-byte i has a multi-byte uppercase (and conversely the single-byte
I has a multi-byte lowercase) - that is, 'i' and 'I' are not case pairs.

> +      else
> +        {
> +          *p++ = '[';
> +          *p++ = *keys;
> +          *p++ = islower (*keys) ? toupper (*keys) : tolower (*keys);

This performs the ASCII-only toupper/tolower, rather than the proper
locale-sensitive case conversion, which probably makes this patch
misbehave for LC_ALL=tr_TR.UTF-8.

> 
> +  /* As currently implemented, case-insensitive matching is expensive in
> +     multi-byte locales because of a few outlier locales in which some
> +     characters change size when converted to upper or lower case.  To
> +     accommodate those, we revert to searching the input one line at a
> +     time, rather than using the much more efficient buffer search.
> +     However, if we have a plain ascii search string, /foo/, we can
> +     convert it to an equivalent case-insensitive /[fF][oO][oO]/, and thus
> +     avoid the expensive read-and-process-a-line-at-a-time requirement.
> +     Optimize-away the "-i" option, when possible, converting each
> +     candidate alpha, C, in the regexp to [Cc].  */

In other words, this comment describes the very flaw of the outlier
tr_TR locale that you are now violating with your optimization.

Without your patch, this is correct behavior:

$ echo i | LC_ALL=tr_TR.UTF-8 grep -i I
$ echo i | LC_ALL=tr_TR.UTF-8 grep -i i
i

ここで UTF-8 Turkish locale の話が出たと。

Message #11 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering
To: Eric Blake
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Mon, 23 Dec 2013 15:12:26 -0800

On Mon, Dec 23, 2013 at 2:52 PM, Eric Blake <eblake <at> redhat.com> wrote:

>> +  /* Worst case is that every byte of keys will be alpha,
>> +     so every byte B will map to the sequence of 4 bytes [Bb].  */
>
> Umm, is this always true?  Consider the UTF-8 Turkish locale, where

Hi Eric,

Thanks for the review.
Did you miss the "isascii" check in the new trivial_case_convert function?
If you can describe circumstances in which the new patch malfunctions,
please do,
but everything you wrote seems to rely on a false assumption.
E.g., your turkish-I example works fine with my patch.

Message #14 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake
To: Jim Meyering <jim <at> meyering.net>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Mon, 23 Dec 2013 16:30:19 -0700

On 12/23/2013 04:12 PM, Jim Meyering wrote:
> Did you miss the "isascii" check in the new trivial_case_convert function?

No.  But even with that check in place:

> If you can describe circumstances in which the new patch malfunctions,
> please do,
> but everything you wrote seems to rely on a false assumption.

No, it's a quite real complaint - your patch is broken for tr_TR.

> E.g., your turkish-I example works fine with my patch.

isascii('i') is true, but converting 'i' to '[iI]' is incorrect in the
tr_TR locale.  Rather, the conversion must be to '[iİ]'; similarly, 'I'
would be translated to '[Iı]'.  Neither of those conversions fit in 4
bytes (since dotted-capital-I and dotless-lower-i are both multi-byte
characters).

Need help easily finding those characters on a non-Turkish keyboard?  I
used:
$ echo iI | LC_ALL=tr_TR.UTF-8 sed 's/\(.\)\(.\)/\U\1\L\2/'

At any rate, prior to your patch, lower dotless i in the buffer gives an
insensitive match to upper dotless I in the pattern:

$ echo ı | LC_ALL=tr_TR.UTF-8 grep -i I || echo no match
ı

After your patch:

$ echo ı | LC_ALL=tr_TR.UTF-8 src/grep -i I || echo no match
no match

Oops, you failed to match lower dotless i insensitively against upper
dotless I, because upper dotless I is ascii, but you incorrectly
converted it into the wrong pattern.

んーと、#5で投稿したパッチに対して、それじゃ UTF-8 Turkish locale でおかしくなるよという指摘をしてるわけですよね。でも基本的なロジックは2.17で話題になったアレと同じなわけで、大文字化/小文字化で長さが変わる云々はやっぱり高速化とは関係ないって話なんじゃあ…

Message #17 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering
To: Eric Blake
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Mon, 23 Dec 2013 15:54:27 -0800

Thanks for dotting those 'i's.  While there is no risk of buffer
overrun, there would definitely be a problem with the tr_TR locale.
I will resolve it by removing the isascii check and performing
multibyte case conversion to form each [cC] pair.  Of course,
that will mean removing the "4 * byte-length-of-search-string"
buffer size limitation.

I will also add tests based on your examples.

ここで Jim さん multibyte case conversion を入れました。と。

こっからざっくりカット。多重の引用部がたくさんあるのでそっち生かした方が楽だ :)


Message #41 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Pádraig Brady <P <at> draigbrady.com>
Cc: 16232 <16232 <at> debbugs.gnu.org>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Sat, 11 Jan 2014 09:56:53 -0800

On Sat, Jan 11, 2014 at 6:15 AM, Pádraig Brady <P <at> draigbrady.com> wrote:
> On 01/11/2014 11:33 AM, Pádraig Brady wrote:
>> On 01/11/2014 05:40 AM, Jim Meyering wrote:
>>> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <jim <at> meyering.net> wrote:
>>>>> I wonder might this faster path be restricted to a safer but very common input subset of:
>>>>>
>>>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>>>>
>>>> That sounds like a good approach.
>>>> Now I need another test case, to demonstrate that the current code can
>>>> cause trouble.
>>>
>>> Hmm... after thinking about this for a while and actually trying to
>>> break the current code (did not find a way to demonstrate a regression),
>>> I have concluded that the current approach is no worse than the prior
>>> one of matching a case-mapped regexp vs. each case-mapped input line.
>>>
>>> That's not to say that it's perfect, of course.
>>> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
>>> from gnulib's test-ulc-casecmp.c is a great example: this matches:
>>>
>>>     printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
>>> '\x6A\xCC\x8C\xCC\xA3')"
>>>
>>> but this does not, yet probably should:
>>>
>>>     printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf '\x6A\xCC\x8C\xCC\xA3')"
>>>
>>> Can you see a way to demonstrate a regression?
>>
>> Oh right, it doesn't handle these cases already.
>> Fair enough I don't see a regression then.
>
> This is also a good summary of stuff to consider with case:
> http://www.unicode.org/faq/casemap_charprop.html
>
> So picking another case situation from there:
>   "in the Greek script, capital sigma (U+03A3) is the uppercase form of both
>    the regular (U+03C2) and final (U+03C3) lowercase sigma."
>
> One can see that sed handles this:
>   $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
>   ςσΣΣ
>   $ printf '\u03A3\n' | sed 's/.*/&\L&/'
>   Σσ
>
> Though I was surprised the grep (2.14) didn't match any combo of these
>   $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
>   $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
>   $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"
>
> Not a regression of course.

Thank you for the reference and the fine examples.
I'll add the latter as a known-failing test.


Message #47 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady 
To: Jim Meyering
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Sun, 12 Jan 2014 12:56:23 +0000

On 01/12/2014 04:36 AM, Jim Meyering wrote:
> On Sat, Jan 11, 2014 at 6:15 AM, Pádraig Brady <P <at> draigbrady.com> wrote:
>> On 01/11/2014 11:33 AM, Pádraig Brady wrote:
> ...
>> This is also a good summary of stuff to consider with case:
>> http://www.unicode.org/faq/casemap_charprop.html
>>
>> So picking another case situation from there:
>>   "in the Greek script, capital sigma (U+03A3) is the uppercase form of both
>>    the regular (U+03C2) and final (U+03C3) lowercase sigma."
>>
>> One can see that sed handles this:
>>   $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
>>   ςσΣΣ
>>   $ printf '\u03A3\n' | sed 's/.*/&\L&/'
>>   Σσ
>>
>> Though I was surprised the grep (2.14) didn't match any combo of these
>>   $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
>>   $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
>>   $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"
> 
> Actually, if you quote the argument to the latter printf, two of those
> do match, both with -F and without:
> 
> $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf '\u03A3')"
> ςσ
> $ printf '\u03A3\n' | grep -Fi "$(printf '\u03C2')"
> $ printf '\u03A3\n' | grep -Fi "$(printf '\u03C3')"
> Σ

Oops right.
So that's still no regression with the new scheme
since grep is 1:1 here for Σ and σ.

thanks,
Pádraig.

ここでタナカさん登場。

Message #50 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka
To: Padraig Brady
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Wed, 19 Feb 2014 23:22:08 +0900

Hi,

Slow down may be caused by the patch, because MBCSET is processed by not
DFA engine but regexp engine.

I tested performance on grep-2.17 and the version which the patch is reverted.
Latter is 100x faster.

yes $(printf '%078dm' 0)|head -10000 > in

(略)

Message #56 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering
To: Norihiro Tanaka
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Wed, 19 Feb 2014 11:17:16 -0800

On Wed, Feb 19, 2014 at 6:22 AM, Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> for i in $(seq 10); do env LC_ALL=ja_JP.eucJP time src/grep -i n in; done

Wow.  You're right.  With the attached patch, I see a speedup of more
than 130x in this case: (fyi, the "time" output is slightly different,
because I have installed GNU time)

grep-2.17$ for i in $(seq 5); do env LC_ALL=ja_JP.eucJP time grep -i n in; done
        2.78 real         2.78 user         0.00 sys
        2.73 real         2.73 user         0.00 sys
        2.75 real         2.75 user         0.00 sys
        2.73 real         2.73 user         0.00 sys
        2.74 real         2.74 user         0.00 sys

2.17+patch$ for i in $(seq 5); do env LC_ALL=ja_JP.eucJP time src/grep -i n in; done
        0.02 real         0.02 user         0.00 sys
        0.02 real         0.02 user         0.00 sys
        0.02 real         0.02 user         0.00 sys
        0.02 real         0.02 user         0.00 sys
        0.02 real         0.02 user         0.00 sys

I haven't investigated they "why" yet, but expect that I will make
grep-2.18 with just this one performance-improving patch.

Thank you, Norihiro,

Jim

200倍時間がかかるようになってたというのはここから出た数字ですね。んで、2.18すぐ出すよ。と。

そして問題の核心へ

Message #68 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering
To: Norihiro Tanaka
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Wed, 19 Feb 2014 19:44:59 -0800

Hmm... it's not as clear-cut as I first thought.
(I built 2.17+ the above patch and put it in a directory named grep-2.18)

The following times 2.16, 2.17 and 2.17+patch two ways:

$ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
$ for i in 16 17 18; do echo $i; env LC_ALL=en_US.UTF-8 time /p/p/grep-2.$i/bin/grep -i foobar k; done
16
       15.96 real        14.57 user         0.12 sys
17
        1.13 real         1.07 user         0.06 sys
18
        1.96 real         1.89 user         0.06 sys

The above search takes more than 70% longer with the proposed patch.

Contrast that with performance in the non-UTF8 ja_JP.eucJP locale:

$ yes $(printf '%078dm' 0)|head -10000 > in
$ for i in 16 17 18; do echo $i; env LC_ALL=ja_JP.eucJP time /p/p/grep-2.$i/bin/grep -i n in; done
16
        0.03 real         0.02 user         0.00 sys
17
        2.98 real         2.96 user         0.00 sys
18
        0.02 real         0.02 user         0.00 sys

Using the jjj+foobar example, but with only 100k lines, we see there
was a 200x performance regression going from grep-2.16 to 2.17:

$ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -100000 > k
$ for i in 16 17 18; do echo $i; env LC_ALL=ja_JP.eucJP time /p/p/grep-2.$i/bin/grep -i foobar k; done
16
        0.15 real         0.14 user         0.00 sys
17
       27.74 real        27.72 user         0.01 sys
18
        0.11 real         0.11 user         0.00 sys

Obviously, I want to retain all of 2.17's performance gain in UTF-8 locales,
while avoiding the 200x penalty in multi-byte non-UTF8 locales like ja_JP.eucJP.
So I have prepared a better patch.
With the two attached commits (on top of 2.17), I get these timings,
i.e., the same 200x improvement with ja_JP.eucJP, and no regression with en_US.UTF8)

$ for i in 16 17 18; do printf "$i: "; env LC_ALL=ja_JP.eucJP time /p/p/grep-2.$i/bin/grep -i foobar k; done
16:         0.14 real         0.14 user         0.00 sys
17:        27.97 real        27.95 user         0.01 sys
18:         0.12 real         0.12 user         0.00 sys

$ for i in 16 17 18; do printf "$i: "; env LC_ALL=en_US.UTF-8 time /p/p/grep-2.$i/bin/grep -i foobar k; done
16:         0.13 real         0.12 user         0.00 sys
17:         0.01 real         0.01 user         0.00 sys
18:         0.01 real         0.01 user         0.00 sys

Message #71 received at 16232 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro TANAKA
To: Jim Meyering
Cc: 16232 <16232 <at> debbugs.gnu.org>, Padraig Brady <P <at> draigbrady.com>
Subject: Re: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Thu, 20 Feb 2014 22:39:18 +0900

Hi Jim,

Your patch is probably right.

However, I think that the true cause for 100x slow is that DFA engine is
slower than regex engine for case-insensitive matching on a non-UTF-8
locle.

On a multibyte locale, for case-insensitive "a" grep prefers DFA engine,
but for character class "[Aa]" prefers regex engine.

Norihiro

このあとも幾つかやり取りがありますが、速くなったり一部の環境でやたらと遅くなったりはこう言う話 (On a multibyte locale, for case-insensitive "a" grep prefers DFA engine, but for character class "[Aa]" prefers regex engine.) であったと。 DFA engine でない方のときに遅くなるというのは過去の経験からしてまあ納得なんだけど、きっちり追いかけてはいないので分からない部分もまだ残っていたり。

■_

他のも見てみるか。 GNU bug reports: package grep

■_

さて、grep の話はどこまで追いかけたものやら…w

でもまああの「間違い」は訂正されることなく定説化するんだろうねえ。

2014年03月01日

■_

読了。感想等は日を改めて。シグナル&ノイズ天才データアナリストの「予測学」

Apple Secure Programming Guide : programming ざっとだけど読んだ。悪くない出来だと思う。あの騒ぎの後ではあまり信頼性がないかもしれないけどw

三日から予約開始と。怪獣酒場

■_

Lessons Learned from Apple's GoToFail Bug あと二つばかりこれと同じ感じのアーティクルがあったんだけどどこいった Pocket にみあたらんぞー

3/2追記 Learning from Apple’s #gotofail Security Bug | Arie van Deursen spootnik.org – Why were there gotos in apple software in the first place ?

■_ GNU grep

さてこの先どう進めたものか悩んでいたのですが (ちょっととっちらかってきたので)、ふと grep.git - grep を見てみると… grep.git - grep grep.git - grep なにこれ。

grep.git - grep

grep: remove trivial_case_ignore
* src/main.c (trivial_case_ignore): Remove. (main): Remove its use; this optimization is no longer needed.

grep.git - grep

grep: don't match line-by-line for case-insensitive with grep and awk
* src/main.c (matcher): Move decl up. (do_execute): With the grep or awk matchers, no need to match line by line.

確かにタナカさんのやってた高速化と、Jim さんのやってた trivial_case_ignore は両方入れる必要はないというか逆効果になっちゃうパターンがあるとは思ったんだけどなくすか trivial_case_ignore。

んーと、このレポートかな #16912 - [PATCH] no longer use CSET for non-UTF8 locale in DFA engine - GNU bug report logs

#16912 - [PATCH] no longer use CSET for non-UTF8 locale in DFA engine - GNU bug report logs

I have overlooked the important thing about optimization by trivial_case_ignore. After optimization by
trivial_case_ignore, kwset engine can be used yet. However, if remove trivial_case_ignore, it's never used
longer because kwsmusts does nothing when MB_CUR_MAX > 1 && match_icase.

The patch reverts removal of trivial_case_ignore and fixes 200x slower for non-UTF8 locales with another
approach. It always prefers CSET to replacement to OR and no longer use CSET for non-UTF8 locales in DFA engine.

It can also optimize by trivial_case_ignore and enables to speed-up >20x for non-UTF8 locales. (I tested it
with euc-jp)

Norihiro

ということで今日はお休み。

■_