Discussion:
Regular expression: option match after a greedy/non-greedy match
Viet-Duc Le
2014-09-17 02:23:08 UTC
Permalink
p{margin:0;padding:0;}


Greeting from S. Korea !

I am parsing the output of ffmpeg with perl. Particular, I want to print only these lines among the output and capturing the resolution, i.e. 1280x720.
....

Stream #0:0: Video: h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9, 23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
Stream #0:1(jpn): Audio: ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)
Stream #0:2(eng): Subtitle: ass (default)
.....


My code is following:
# INFO is pipe to ffmpeg
# Here, the <print "$1 $2 $3 $4\n"> is for debugging .
while ( <INFO> ) {
if ( <regular expression> ) {
print "$1 $2 $3 $4\n";
}
}
Desirable outputs:
-> Video 1280 720
Audio
Subtitle

Regarding the <regular expession>:
1. /Stream #\d:\d.*(Video|Audio|Subtitle).*(\d+)x(\d+)/ (greedy)
-> Video 0 720
Q: why does $2 give 0? I remember .* match backward starting from the end of the string. Then it should be "Video 1280 720" as output.

2. /Stream #\d:\d.*(Video|Audio|Subtitle).*?(\d+)x(\d+)/ (non greedy)
-> Video 1280 720
Q: I can understand this, but again I think (1) should work too.

3. /Stream #\d:\d.*(Video|Audio|Subtitle).*?(?:(\d+)x(\d+))?/ ( non-capturing optional group )
-> Video
Audio
Subtitle
Q: It seems that the resolution part is ignored because it is optional. Otherwise, the output will contains "Video" only as (1) and (2). How can I circumvent this ?

4. /Stream #\d:\d.*(Video|Audio|Subtitle).+?(?:(\d+)x(\d+))?.*?$/
-> Video
Audio
Subtitle
Q: I tried to match things after the resolution, hoping that it will be captured.

5. /Stream #\d:\d.*(Video|Audio|Subtitle).+?(?:(\d+)x(\d+))?(.*?)$/ ( let's capture the last part)
-> Video h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9, 23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
Audio ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)
Subtitle ass (default)
Q: Now $2 and $3 is undef, and the rest of the string went to $4. Again, I am quite puzzled by the output.

Please pardon my long email. I hope someone can point out the flaws in my logic. Here, I can match and print Video/Audio/Subtitle separately.
But I wish for one expression to match them all, one expression to print them.


Best regards,

Viet-Duc
Jing Yu
2014-09-17 03:22:00 UTC
Permalink
Hi Viet-Duc Le,
Post by Viet-Duc Le
Greeting from S. Korea !
I am parsing the output of ffmpeg with perl. Particular, I want to print only these lines among the output and capturing the resolution, i.e. 1280x720.
....
Stream #0:0: Video: h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9, 23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
Stream #0:1(jpn): Audio: ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)
Stream #0:2(eng): Subtitle: ass (default)
.....
# INFO is pipe to ffmpeg
# Here, the <print "$1 $2 $3 $4\n"> is for debugging .
while ( <INFO> ) {
if ( <regular expression> ) {
print "$1 $2 $3 $4\n";
}
}
-> Video 1280 720
Audio
Subtitle
1. /Stream #\d:\d.*(Video|Audio|Subtitle).*(\d+)x(\d+)/ (greedy)
-> Video 0 720
Q: why does $2 give 0? I remember .* match backward starting from the end of the string. Then it should be "Video 1280 720" as output.
that '0' is from 128'0', since the '.*' consumes 128. What it does under the hood is .* first will reach to the end of the target string, and then backtract according to the following regex. Once the whole regex is satisfied, it will stop backtracting, although further retracting will possibly also satisfy the regex.
Post by Viet-Duc Le
2. /Stream #\d:\d.*(Video|Audio|Subtitle).*?(\d+)x(\d+)/ (non greedy)
-> Video 1280 720
Q: I can understand this, but again I think (1) should work too.
3. /Stream #\d:\d.*(Video|Audio|Subtitle).*?(?:(\d+)x(\d+))?/ ( non-capturing optional group )
-> Video
Audio
Subtitle
Q: It seems that the resolution part is ignored because it is optional. Otherwise, the output will contains "Video" only as (1) and (2). How can I circumvent this ?
that ?: prevents $ variables to capture the matching regex group. I guess you can get rid of it. The trailing ? already tells the regex group to match optionally. It is equivalent to {0,1}. The big problem coming with it is the middle .*?. Since the last part is optional, .*? will just match the least number of char possible, which is nothing.
Post by Viet-Duc Le
4. /Stream #\d:\d.*(Video|Audio|Subtitle).+?(?:(\d+)x(\d+))?.*?$/
-> Video
Audio
Subtitle
Q: I tried to match things after the resolution, hoping that it will be captured.
Again the ?: prevents it being captured. .+? in the middle is better, now it matches ':'.
Post by Viet-Duc Le
5. /Stream #\d:\d.*(Video|Audio|Subtitle).+?(?:(\d+)x(\d+))?(.*?)$/ ( let's capture the last part)
-> Video h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9, 23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
Audio ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)
Subtitle ass (default)
Q: Now $2 and $3 is undef, and the rest of the string went to $4. Again, I am quite puzzled by the output.
If it is optional, it is non greedy. So everything goes to the (.*?)$.
Post by Viet-Duc Le
Please pardon my long email. I hope someone can point out the flaws in my logic. Here, I can match and print Video/Audio/Subtitle separately.
But I wish for one expression to match them all, one expression to print them.
In general, it is a better practise to add 'x' to your regex to make it more readable. My regex might not be the best, but it works as expected.

use strict;
use warnings;
use 5.16.0;

while(<DATA>){
/ (Video|Audio|Subtitle) (?: (?:.) +? (\d+x\d+) || (?:.)+ ) /x
and say $1, $2, $3, $4;
}


__DATA__
Stream #0:0: Video: h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9, 23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
Stream #0:1(jpn): Audio: ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)
Stream #0:2(eng): Subtitle: ass (default)

The '||' operator will first check the group before it. It will only look at the other group if the first group fails. This puts your resolution group matching as priority, but not necessity.

Hope this helps.
Jing
Uday Vernekar
2014-09-17 09:08:38 UTC
Permalink
When i run this script i get following Error

bash-4.2$ ./regex.pl
feature version v5.16.0 required--this is only version v1.160.0 at ./
regex.pl line 4.
BEGIN failed--compilation aborted at ./regex.pl line 4.



But I am using perl version as swon below.

bash-4.2$ perl -v

This is perl 5, version 16, subversion 3 (v5.16.3) built for i686-linux

Copyright 1987-2012, Larry Wall

Perl may be copied only under the terms of either the Artistic License or
the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
Post by Jing Yu
Hi Viet-Duc Le,
Greeting from S. Korea !
I am parsing the output of ffmpeg with perl. Particular, I want to print
only
these lines among the output and capturing the resolution, i.e. 1280x720.
....
Stream #0:0: Video: h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9, 23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
Stream #0:1(jpn): Audio: ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)
Stream #0:2(eng): Subtitle: ass (default)
.....
# INFO is pipe to ffmpeg
# Here, the <print "$1 $2 $3 $4\n"> is for debugging .
while ( <INFO> ) {
if ( <regular expression> ) {
print "$1 $2 $3 $4\n";
}
}
-> Video 1280 720
Audio
Subtitle
1. /Stream #\d:\d.*(Video|Audio|Subtitle).*(\d+)x(\d+)/ (greedy)
-> Video 0 720
Q: why does $2 give 0? I remember .* match backward starting from the end of the string. Then it should be "Video 1280 720" as output.
that '0' is from 128'0', since the '.*' consumes 128. What it does under
the hood is .* first will reach to the end of the target string, and then
backtract according to the following regex. Once the whole regex is
satisfied, it will stop backtracting, although further retracting will
possibly also satisfy the regex.
2. /Stream #\d:\d.*(Video|Audio|Subtitle).*?(\d+)x(\d+)/ (non greedy)
-> Video 1280 720
Q: I can understand this, but again I think (1) should work too.
3. /Stream #\d:\d.*(Video|Audio|Subtitle).*?(?:(\d+)x(\d+))?/ ( non-capturing optional group )
-> Video
Audio
Subtitle
Q: It seems that the resolution part
is ignored because it is optional. Otherwise, the output will contains "Video" only as (1) and (2). How can I circumvent this ?
that ?: prevents $ variables to capture the matching regex group. I guess
you can get rid of it. The trailing ? already tells the regex group to
match optionally. It is equivalent to {0,1}. The big problem coming with it
is the middle .*?. Since the last part is optional, .*? will just match the
least number of char possible, which is nothing.
4. /Stream #\d:\d.*(Video|Audio|Subtitle).+?(?:(\d+)x(\d+))?.*?$/
-> Video
Audio
Subtitle
Q: I tried to match things after the resolution, hoping that it will be captured.
Again the ?: prevents it being captured. .+? in the middle is better, now it matches ':'.
5. /Stream #\d:\d.*(Video|Audio|Subtitle).+?(?:(\d+)x(\d+))?(.*?)$/ ( let's capture the last part)
-> Video h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9, 23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
Audio ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)
Subtitle ass (default)
Q: Now $2 and $3 is undef, and the rest of the string went to $4. Again, I am
quite puzzled by the output.
If it is optional, it is non greedy. So everything goes to the (.*?)$.
Please pardon my long email. I hope someone can point out the flaws in my logic.
Here, I can match and print Video/Audio/Subtitle separately.
But I wish for one expression to match them all, one expression to print them.
In general, it is a better practise to add 'x' to your regex to make it
more readable. My regex might not be the best, but it works as expected.
use strict;
use warnings;
use 5.16.0;
while(<DATA>){
/ (Video|Audio|Subtitle) (?: (?:.) +? (\d+x\d+) || (?:.)+ ) /x
and say $1, $2, $3, $4;
}
__DATA__
Stream #0:0: Video: h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9,
23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
Stream #0:1(jpn): Audio: ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)
Stream #0:2(eng): Subtitle: ass (default)
The '||' operator will first check the group before it. It will only look
at the other group if the first group fails. This puts your resolution
group matching as priority, but not necessity.
Hope this helps.
Jing
--
*********************************************************
Don't ask them WHY they hurt you,
because all they'll tell you is lies and excuses.
Just know they were wrong, and try to move on.
**********************************************************
Jing Yu
2014-09-17 09:11:43 UTC
Permalink
Post by Uday Vernekar
When i run this script i get following Error
bash-4.2$ ./regex.pl
feature version v5.16.0 required--this is only version v1.160.0 at ./regex.pl line 4.
BEGIN failed--compilation aborted at ./regex.pl line 4.
But I am using perl version as swon below.
bash-4.2$ perl -v
This is perl 5, version 16, subversion 3 (v5.16.3) built for i686-linux
Copyright 1987-2012, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
Strange... I only used 5.16.0 for the feature 'say'. You can of course omit that part and change 'say' to 'print', and hang a "\n" at the end instead.

Cheers,
Jing
Uday Vernekar
2014-09-17 09:14:33 UTC
Permalink
when i change
use 5.16.0; to use feature ':5.10';

it works i get following output

bash-4.2$ ./regex.pl
Use of uninitialized value $3 in say at ./regex.pl line 7, <DATA> line 1.
Use of uninitialized value $4 in say at ./regex.pl line 7, <DATA> line 1.
Video1280x720
Use of uninitialized value $2 in say at ./regex.pl line 7, <DATA> line 2.
Use of uninitialized value $3 in say at ./regex.pl line 7, <DATA> line 2.
Use of uninitialized value $4 in say at ./regex.pl line 7, <DATA> line 2.
Audio
Use of uninitialized value $2 in say at ./regex.pl line 7, <DATA> line 3.
Use of uninitialized value $3 in say at ./regex.pl line 7, <DATA> line 3.
Use of uninitialized value $4 in say at ./regex.pl line 7, <DATA> line 3.
Subtitle


how these two use statements differ.

use 5.16.0;

perl regex.pl works

why ./regex.pl doesnt work.

it gives following error
feature version v5.16.0 required--this is only version v1.160.0 at ./
regex.pl line 4.
BEGIN failed--compilation aborted at ./regex.pl line 4.
Post by Uday Vernekar
When i run this script i get following Error
bash-4.2$ ./regex.pl
feature version v5.16.0 required--this is only version v1.160.0 at ./
regex.pl line 4.
BEGIN failed--compilation aborted at ./regex.pl line 4.
But I am using perl version as swon below.
bash-4.2$ perl -v
This is perl 5, version 16, subversion 3 (v5.16.3) built for i686-linux
Copyright 1987-2012, Larry Wall
Perl may be copied only under the terms of either the Artistic License or
the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
Post by Jing Yu
Hi Viet-Duc Le,
Greeting from S. Korea !
I am parsing the output of ffmpeg with perl. Particular, I want to print
only
these lines among the output and capturing the resolution, i.e. 1280x720.
....
Stream #0:0: Video: h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9, 23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
Stream #0:1(jpn): Audio: ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)
Stream #0:2(eng): Subtitle: ass (default)
.....
# INFO is pipe to ffmpeg
# Here, the <print "$1 $2 $3 $4\n"> is for debugging .
while ( <INFO> ) {
if ( <regular expression> ) {
print "$1 $2 $3 $4\n";
}
}
-> Video 1280 720
Audio
Subtitle
1. /Stream #\d:\d.*(Video|Audio|Subtitle).*(\d+)x(\d+)/ (greedy)
-> Video 0 720
Q: why does $2 give 0? I remember .* match backward starting from the end of the string. Then it should be "Video 1280 720" as output.
that '0' is from 128'0', since the '.*' consumes 128. What it does under
the hood is .* first will reach to the end of the target string, and then
backtract according to the following regex. Once the whole regex is
satisfied, it will stop backtracting, although further retracting will
possibly also satisfy the regex.
2. /Stream #\d:\d.*(Video|Audio|Subtitle).*?(\d+)x(\d+)/ (non greedy)
-> Video 1280 720
Q: I can understand this, but again I think (1) should work too.
3. /Stream #\d:\d.*(Video|Audio|Subtitle).*?(?:(\d+)x(\d+))?/ ( non-capturing optional group )
-> Video
Audio
Subtitle
Q: It seems that the resolution part
is ignored because it is optional. Otherwise, the output will contains "Video" only as (1) and (2). How can I circumvent this ?
that ?: prevents $ variables to capture the matching regex group. I guess
you can get rid of it. The trailing ? already tells the regex group to
match optionally. It is equivalent to {0,1}. The big problem coming with it
is the middle .*?. Since the last part is optional, .*? will just match the
least number of char possible, which is nothing.
4. /Stream #\d:\d.*(Video|Audio|Subtitle).+?(?:(\d+)x(\d+))?.*?$/
-> Video
Audio
Subtitle
Q: I tried to match things after the resolution, hoping that it will be captured.
Again the ?: prevents it being captured. .+? in the middle is better, now it matches ':'.
5. /Stream #\d:\d.*(Video|Audio|Subtitle).+?(?:(\d+)x(\d+))?(.*?)$/ ( let's capture the last part)
-> Video h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9, 23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
Audio ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)
Subtitle ass (default)
Q: Now $2 and $3 is undef, and the rest of the string went to $4. Again, I am
quite puzzled by the output.
If it is optional, it is non greedy. So everything goes to the (.*?)$.
Please pardon my long email. I hope someone can point out the flaws in my logic.
Here, I can match and print Video/Audio/Subtitle separately.
But I wish for one expression to match them all, one expression to print them.
In general, it is a better practise to add 'x' to your regex to make it
more readable. My regex might not be the best, but it works as expected.
use strict;
use warnings;
use 5.16.0;
while(<DATA>){
/ (Video|Audio|Subtitle) (?: (?:.) +? (\d+x\d+) || (?:.)+ ) /x
and say $1, $2, $3, $4;
}
__DATA__
Stream #0:0: Video: h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9,
23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
Stream #0:1(jpn): Audio: ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)
Stream #0:2(eng): Subtitle: ass (default)
The '||' operator will first check the group before it. It will only look
at the other group if the first group fails. This puts your resolution
group matching as priority, but not necessity.
Hope this helps.
Jing
--
*********************************************************
Don't ask them WHY they hurt you,
because all they'll tell you is lies and excuses.
Just know they were wrong, and try to move on.
**********************************************************
--
*********************************************************
Don't ask them WHY they hurt you,
because all they'll tell you is lies and excuses.
Just know they were wrong, and try to move on.
**********************************************************
Viet-Duc Le
2014-09-17 16:11:06 UTC
Permalink
Dear Jing,

I was confused when I started out the regular expression. Many thanks for the kind and detailed explanation.
After reading more on perl regex, I think I have a better grasp of the greedy/non-greedy concept now.
Your code also worked well for my task.

Regards,
Viet-Duc

-----------------------Original Message-----------------------
From: Jing Yu <***@googlemail.com>
To: Viet-Duc Le <***@kaist.ac.kr>
Sent date: 2014-09-17 12:20:29 GMT +0900 (Asia/Seoul)
Subject: Re: Regular expression: option match after a greedy/non-greedy match

Hi Viet-Duc Le,
On 17 Sep 2014, at 10:23, Viet-Duc Le <***@kaist.ac.kr> wrote:
Greeting from S. Korea !

I am parsing the output of ffmpeg with perl. Particular, I want to print only these lines among the output and capturing the resolution, i.e. 1280x720.
....
Stream #0:0: Video: h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9, 23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
Stream #0:1(jpn): Audio: ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)
Stream #0:2(eng): Subtitle: ass (default)
.....
My code is following:
# INFO is pipe to ffmpeg
# Here, the <print "$1 $2 $3 $4\n"> is for debugging .
while ( <INFO> ) {
if ( <regular expression> ) {
print "$1 $2 $3 $4\n";
}
}
Desirable outputs:
-> Video 1280 720
Audio
Subtitle

Regarding the <regular expession>:
1. /Stream #\d:\d.*(Video|Audio|Subtitle).*(\d+)x(\d+)/ (greedy)
-> Video 0 720
Q: why does $2 give 0? I remember .* match backward starting from the end of the string. Then it should be "Video 1280 720" as output.
that '0' is from 128'0', since the '.*' consumes 128. What it does under the hood is .* first will reach to the end of the target string, and then backtract according to the following regex. Once the whole regex is satisfied, it will stop backtracting, although further retracting will possibly also satisfy the regex.

2. /Stream #\d:\d.*(Video|Audio|Subtitle).*?(\d+)x(\d+)/ (non greedy)
-> Video 1280 720
Q: I can understand this, but again I think (1) should work too.

3. /Stream #\d:\d.*(Video|Audio|Subtitle).*?(?:(\d+)x(\d+))?/ ( non-capturing optional group )
-> Video
Audio
Subtitle
Q: It seems that the resolution part is ignored because it is optional. Otherwise, the output will contains "Video" only as (1) and (2). How can I circumvent this ?
that ?: prevents $ variables to capture the matching regex group. I guess you can get rid of it. The trailing ? already tells the regex group to match optionally. It is equivalent to {0,1}. The big problem coming with it is the middle .*?. Since the last part is optional, .*? will just match the least number of char possible, which is nothing.

4. /Stream #\d:\d.*(Video|Audio|Subtitle).+?(?:(\d+)x(\d+))?.*?$/
-> Video
Audio
Subtitle
Q: I tried to match things after the resolution, hoping that it will be captured.

Again the ?: prevents it being captured. .+? in the middle is better, now it matches ':'.
5. /Stream #\d:\d.*(Video|Audio|Subtitle).+?(?:(\d+)x(\d+))?(.*?)$/ ( let's capture the last part)
-> Video h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9, 23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)
Audio ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)
Subtitle ass (default)
Q: Now $2 and $3 is undef, and the rest of the string went to $4. Again, I am quite puzzled by the output.
If it is optional, it is non greedy. So everything goes to the (.*?)$.

Please pardon my long email. I hope someone can point out the flaws in my logic. Here, I can match and print Video/Audio/Subtitle separately.
But I wish for one expression to match them all, one expression to print them.
In general, it is a better practise to add 'x' to your regex to make it more readable. My regex might not be the best, but it works as expected.
use strict;use warnings;use 5.16.0;
while(<DATA>){ / (Video|Audio|Subtitle) (?: (?:.) +? (\d+x\d+) || (?:.)+ ) /x and say $1, $2, $3, $4;}

__DATA__Stream #0:0: Video: h264 (High), yuv420p, 1280x720, SAR 1:1 DAR 16:9, 23.98 fps, 23.98 tbr, 1k tbn, 47.95 tbc (default)Stream #0:1(jpn): Audio: ac3, 48000 Hz, stereo, fltp, 192 kb/s (default)Stream #0:2(eng): Subtitle: ass (default)
The '||' operator will first check the group before it. It will only look at the other group if the first group fails. This puts your resolution group matching as priority, but not necessity.
Hope this helps.Jing

Loading...