热门标签:
Q:

Python做\g锚定解析循环的方式是什么?

下面是我多年前写的一个perl函数。 它是一个聪明的标记器,可以识别一些可能不应该粘在一起的事情。 例如,给定左侧的输入,它将字符串分割为右侧所示:

abc123  -> abc|123
abcABC  -> abc|ABC
ABC123  -> ABC|123
123abc  -> 123|abc
123ABC  -> 123|ABC
AbcDef  -> Abc|Def    (e.g. CamelCase)
ABCDef  -> ABC|Def    
1stabc  -> 1st|abc    (recognize valid ordinals)
1ndabc  -> 1|ndabc    (but not invalid ordinals)
11thabc -> 11th|abc   (recognize that 11th - 13th are different than 1st - 3rd)
11stabc -> 11|stabc

我现在正在做一些机器学习实验,我想做一些使用这个标记器的实验。 但首先,我需要将它从Perl移植到Python。 这段代码的关键是使用\G锚点的循环,我听到的东西在python中不存在。 我已经尝试谷歌搜索这是如何在Python中完成的,但我不确定究竟要搜索什么,所以我很难找到答案。

你会如何用Python编写这个函数?

sub Tokenize
# Breaks a string into tokens using special rules,
# where a token is any sequence of characters, be they a sequence of letters, 
# a sequence of numbers, or a sequence of non-alpha-numeric characters
# the list of tokens found are returned to the caller
{
    my $value = shift;
    my @list = ();
    my $word;

    while ( $value ne '' && $value =~ m/
        \G                # start where previous left off
        ([^a-zA-Z0-9]*)   # capture non-alpha-numeric characters, if any
        ([a-zA-Z0-9]*?)   # capture everything up to a token boundary
        (?:               # identify the token boundary
            (?=[^a-zA-Z0-9])       # next character is not a word character 
        |   (?=[A-Z][a-z])         # Next two characters are upper lower
        |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
        |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
                # ordinal boundaries
        |   (?<=^1(?i:st))         # first
        |   (?<=[^1][1](?i:st))    # first but not 11th
        |   (?<=^2(?i:nd))         # second
        |   (?<=[^1]2(?i:nd))      # second but not 12th
        |   (?<=^3(?i:rd))         # third
        |   (?<=[^1]3(?i:rd))      # third but not 13th
        |   (?<=1[123](?i:th))     # 11th - 13th
        |   (?<=[04-9](?i:th))     # other ordinals
                # non-ordinal digit-letter boundaries
        |   (?<=^1)(?=[a-zA-Z])(?!(?i)st)       # digit-letter but not first
        |   (?<=[^1]1)(?=[a-zA-Z])(?!(?i)st)    # digit-letter but not 11th
        |   (?<=^2)(?=[a-zA-Z])(?!(?i)nd)       # digit-letter but not first
        |   (?<=[^1]2)(?=[a-zA-Z])(?!(?i)nd)    # digit-letter but not 12th
        |   (?<=^3)(?=[a-zA-Z])(?!(?i)rd)       # digit-letter but not first
        |   (?<=[^1]3)(?=[a-zA-Z])(?!(?i)rd)    # digit-letter but not 13th
        |   (?<=1[123])(?=[a-zA-Z])(?!(?i)th)   # digit-letter but not 11th - 13th
        |   (?<=[04-9])(?=[a-zA-Z])(?!(?i)th)   # digit-letter but not ordinal
        |   (?=$)                               # end of string
        )
    /xg )
    {
        push @list, $1 if $1 ne '';
        push @list, $2 if $2 ne '';
    }
    return @list;
}

我确实尝试使用re。split()与上面的变化。 但是,split()拒绝在零宽度匹配上分裂(如果一个人真的知道自己在做什么,这种能力应该是可能的)。

我确实提出了这个具体问题的解决方案,但没有解决"如何使用基于\G的解析"的一般问题-我有一些示例代码在使用\G锚定的循环中进行正则表达式,然 所以我还在寻找答案。

也就是说,这是我将上述内容翻译为Python的最终工作代码:

import re

IsA                 = lambda s: '['  + s + ']'
IsNotA              = lambda s: '[^' + s + ']'

Upper               = IsA( 'A-Z' )
Lower               = IsA( 'a-z' )
Letter              = IsA( 'a-zA-Z' )
Digit               = IsA( '0-9' )
AlphaNumeric        = IsA( 'a-zA-Z0-9' )
NotAlphaNumeric     = IsNotA( 'a-zA-Z0-9' ) 

EndOfString         = '$'
OR                  = '|'

ZeroOrMore          = lambda s: s + '*'
ZeroOrMoreNonGreedy = lambda s: s + '*?'
OneOrMore           = lambda s: s + '+'
OneOrMoreNonGreedy  = lambda s: s + '+?'

StartsWith          = lambda s: '^' + s
Capture             = lambda s: '('    + s + ')'
PreceededBy         = lambda s: '(?<=' + s + ')'
FollowedBy          = lambda s: '(?='  + s + ')'
NotFollowedBy       = lambda s: '(?!'  + s + ')'
StopWhen            = lambda s: s
CaseInsensitive     = lambda s: '(?i:' + s + ')'

ST                  = '(?:st|ST)'
ND                  = '(?:nd|ND)'
RD                  = '(?:rd|RD)'
TH                  = '(?:th|TH)'

def OneOf( *args ):
  return '(?:' + '|'.join( args ) + ')'

pattern = '(.+?)' + \
  OneOf( 
    # ABC | !!! - break at whitespace or non-alpha-numeric boundary
    PreceededBy( AlphaNumeric ) + FollowedBy( NotAlphaNumeric ),
    PreceededBy( NotAlphaNumeric ) + FollowedBy( AlphaNumeric ),

    # ABC | Abc - break at what looks like the start of a word or sentence
    FollowedBy( Upper + Lower ),

    # abc | ABC - break when a lower-case letter is followed by an upper case
    PreceededBy( Lower )  + FollowedBy( Upper ),

    # abc | 123 - break between words and digits
    PreceededBy( Letter ) + FollowedBy( Digit ),

    # 1st | oak - recognize when the string starts with an ordinal
    PreceededBy( StartsWith( '1' + ST ) ),
    PreceededBy( StartsWith( '2' + ND ) ),
    PreceededBy( StartsWith( '3' + RD ) ),

    # 1st | abc - contains an ordinal
    PreceededBy( IsNotA( '1' ) + '1' + ST ),
    PreceededBy( IsNotA( '1' ) + '2' + ND ),
    PreceededBy( IsNotA( '1' ) + '3' + RD ),
    PreceededBy( '1' + IsA( '123' )  + TH ),
    PreceededBy( IsA( '04-9' )       + TH ),

    # 1 | abcde - recognize when it starts with or contains a non-ordinal digit/letter boundary
    PreceededBy( StartsWith( '1' ) ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
    PreceededBy( StartsWith( '2' ) ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
    PreceededBy( StartsWith( '3' ) ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
    PreceededBy( IsNotA( '1' ) + '1' ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
    PreceededBy( IsNotA( '1' ) + '2' ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
    PreceededBy( IsNotA( '1' ) + '3' ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
    PreceededBy( '1' + IsA( '123' ) )  + FollowedBy( Letter ) + NotFollowedBy( TH ),
    PreceededBy( IsA( '04-9' ) )       + FollowedBy( Letter ) + NotFollowedBy( TH ),

    # abcde | $ - end of the string
    FollowedBy( EndOfString )
  )

matcher = re.compile( pattern )

def tokenize( s ):
  return matcher.findall( s )
原网址
A:

re.RegexObject.match

在正则表达式的开头模拟 \G

您可以通过跟踪并将起始位置提供给 re来模拟 \G 在带有 re 模块的正则表达式开头的效果。RegexObject。match ,强制匹配从 pos 中的指定位置开始。

def tokenize(w):
    index = 0
    m = matcher.match(w, index)
    o = []
    # Although index != m.end() check zero-length match, it's more of
    # a guard against accidental infinite loop.
    # Don't expect a regex which can match empty string to work.
    # See Caveat section.
    while m and index != m.end():
        o.append(m.group(1))
        index = m.end()
        m = matcher.match(w, index)
    return o

注意事项

这种方法的一个警告是,它不能很好地与regex匹配主匹配中的空字符串,因为Python没有任何设施来强制regex重试匹配,同时防止零长度匹配。

例如, re.findall(r'(.??)', 'abc') 返回一个由4个空字符串 ['', '', '', ''] 组成的数组,而在PCRE中,您可以找到7个匹配 ['', 'a', '', 'b', '', 'c' ''] ,其中第2,第4和第6个匹配分别从与第1,第3和第5个匹配相同的 PCRE中的其他匹配通过在相同索引处重试来找到,该索引具有防止空字符串匹配的标志。

我知道问题是关于Perl,而不是PCRE,但全局匹配行为应该是相同的。 否则,原始代码就无法工作。

重写 ([^a-zA-Z0-9]*)([a-zA-Z0-9]*?) (。+?) ,正如在问题中所做的那样,避免了这个问题,尽管您可能想要使用 re。S 标志。

对正则表达式的其他评论

由于Python中不区分大小写的标志会影响整个模式,因此必须重写不区分大小写的子模式。 我会将 (?i:st) 重写为 [sS][tT] 以保留原始含义,但如果它是您的要求的一部分,请使用 (?:st|ST)

因为Python支持带有 re的 自由间距模式。X flag,你可以写你的正则表达式类似于你在Perl代码中所做的:

matcher = re.compile(r'''
    (.+?)
    (?:               # identify the token boundary
        (?=[^a-zA-Z0-9])       # next character is not a word character 
    |   (?=[A-Z][a-z])         # Next two characters are upper lower
    |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
    |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
            # ordinal boundaries
    |   (?<=^1[sS][tT])         # first
    |   (?<=[^1][1][sS][tT])    # first but not 11th
    |   (?<=^2[nN][dD])         # second
    |   (?<=[^1]2[nN][dD])      # second but not 12th
    |   (?<=^3[rR][dD])         # third
    |   (?<=[^1]3[rR][dD])      # third but not 13th
    |   (?<=1[123][tT][hH])     # 11th - 13th
    |   (?<=[04-9][tT][hH])     # other ordinals
            # non-ordinal digit-letter boundaries
    |   (?<=^1)(?=[a-zA-Z])(?![sS][tT])       # digit-letter but not first
    |   (?<=[^1]1)(?=[a-zA-Z])(?![sS][tT])    # digit-letter but not 11th
    |   (?<=^2)(?=[a-zA-Z])(?![nN][dD])       # digit-letter but not first
    |   (?<=[^1]2)(?=[a-zA-Z])(?![nN][dD])    # digit-letter but not 12th
    |   (?<=^3)(?=[a-zA-Z])(?![rR][dD])       # digit-letter but not first
    |   (?<=[^1]3)(?=[a-zA-Z])(?![rR][dD])    # digit-letter but not 13th
    |   (?<=1[123])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not 11th - 13th
    |   (?<=[04-9])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not ordinal
    |   (?=$)                               # end of string
    )
''', re.X)

相似问题