[TechSucks] strange sed behaviour

I have an IRC bot powered by ii running that automatically prints the content of the <title> Tag of any URL posted by itself without an explanation of the URL.
It does this this way:
wget -o /dev/null -O - "http://www.example.com/" | tr '\n' ' ' | tr -d $'\r' > tmp
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' tmp )
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' tmp )
With websites from Spiegel Online this gives problems I can't trace. I provide an example website on which the sed call gives different results based on the LANG environment variable:
export LANG=C
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title>
export LANG=en_US.utf8
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title> PLUS everything after an Ü character
sed --version
GNU sed version 4.1.5
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title>
export LANG=en_US.utf8
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title> PLUS everything after an Ü character
sed --version
GNU sed version 4.1.5
Can someone explain this?
EOF
Back home
13:03:10 23.04.2007 | Permalink | Trackback URI
flipflip (2007-04-24 08:31:11)
I cannot reproduce your problem. But maybe the following works:
title=`wget -qO- http://blog.crash-override.net/img/spiegel.html | sed 's,.*<title>\(.*\)</title>.*,\1,mi'`; echo $title
Klimafolgen: China fürchtet dramatischen Rückgang der Reisproduktion - Wissenschaft - SPIEGEL ONLINE - Nachrichten
blindcoder (2007-04-24 08:31:53)
I tried it, but with the same result. .* stops matching at 'Ü'.


