Stoppt die Vorratsdatenspeicherung! Jetzt klicken &handeln! Willst du auch an der Aktion teilnehmen? Hier findest du alle relevanten Infos und Materialien:

The TechSucks TechBlog - blog.crash-override.net

Why technology sucks, and some just sucks less. The view and opinion of an experienced user.

Search:

Archives | Tags | esden | daja77 | Kendo Bilder

[] strange sed behaviour RSS feed for section TechSucks

I have an IRC bot powered by ii running that automatically prints the content of the <title> Tag of any URL posted by itself without an explanation of the URL.
It does this this way:

wget -o /dev/null -O - "http://www.example.com/" | tr '\n' ' ' | tr -d $'\r' > tmp
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' tmp )


With websites from Spiegel Online this gives problems I can't trace. I provide an example website on which the sed call gives different results based on the LANG environment variable:
export LANG=C
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title>
export LANG=en_US.utf8
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title> PLUS everything after an Ü character

sed --version
GNU sed version 4.1.5


Can someone explain this?


EOF

Back home  clock 13:03:10 23.04.2007 | Permalink | Trackback URI

flipflip (2007-04-24 08:31:11)

I cannot reproduce your problem. But maybe the following works:

title=`wget -qO- http://blog.crash-override.net/img/spiegel.html | sed 's,.*<title>\(.*\)</title>.*,\1,mi'`; echo $title
Klimafolgen: China fürchtet dramatischen Rückgang der Reisproduktion - Wissenschaft - SPIEGEL ONLINE - Nachrichten

blindcoder (2007-04-24 08:31:53)

I tried it, but with the same result. .* stops matching at 'Ü'.

Leave a comment

Allowed HTML tags: a abbr acronym b blockquote em li ol p strong sub sup u ul

Name


Link (enter mailto:you@address.com for mailaddress, otherwise http:// is implied)


Comment