Python 正则表达式

/regex

在本教程中，您将学习正则表达式（RegEx），并使用 Python 的`re`模块与 RegEx 一起使用（在示例的帮助下）。

正则表达式（RegEx）是定义搜索模式的字符序列。例如，

^a...s$

上面的代码定义了 RegEx 模式。模式为从a到s的任何五个字母的字符串。

使用 RegEx 定义的模式可用于与字符串匹配。

表达式	字符串	是否匹配
`^a...s$`	`abs`	没有匹配
	`alias`	匹配
	`abyss`	匹配
	`Alias`	没有匹配
	`An abacus`	没有匹配

Python 有一个名为re的模块可与 RegEx 一起使用。这是一个例子：

import re

pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")

在这里，我们使用re.match()函数在test_string中搜索pattern。如果搜索成功，该方法将返回一个匹配对象。如果不是，则返回None。

re模块中定义了其他一些函数，可与 RegEx 一起使用。在探讨之前，让我们学习正则表达式本身。

如果您已经了解 RegEx 的基础知识，请跳至 Python RegEx。

使用 RegEx 指定模式

为了指定正则表达式，使用了元字符。在上面的示例中，^和$是元字符。

元字符

元字符是 RegEx 引擎以特殊方式解释的字符。以下是元字符列表：

[] . ^ $  + ? {} () \ |

[] - 方括号

方括号指定您要匹配的一组字符。

表达式	字符串	是否匹配
`[abc]`	`a`	1 个匹配
	`ac`	2 个匹配
	`Hey Jude`	没有匹配
	`abc de ca`	5 个匹配

在这里，如果您要匹配的字符串包含a，b或c中的任何一个，则[abc]将匹配。

您也可以使用方括号内的-指定字符范围。

[a-e]与[abcde]相同。
[1-4]与[1234]相同。
[0-39]与[01239]相同。

您可以在方括号的开头使用尖号^符号来补充（反转）字符集。

[^abc]表示除a或b或c之外的任何字符。
[^0-9]表示任何非数字字符。

. - 句号

句点匹配任何单个字符（换行符'\n'除外）。

表达式	字符串	是否匹配
`..`	`a`	没有匹配
	`ac`	1 个匹配
	`acd`	1 个匹配
	`acde`	2 个匹配（包含 4 个字符）

^ - 脱字符

插入符号^用于检查字符串是否以某个字符开头。

表达式	字符串	是否匹配
`^a`	`a`	1 个匹配
`abc`	1 个匹配
`bac`	没有匹配
`^ab`	`abc`	1 个匹配
`acb`	没有匹配（以`a`开头，但之后没有`b`）

$ - 美元

美元符号$用于检查字符串是否以某个字符结束。

表达式	字符串	是否匹配
`a$`	`a`	1 个匹配
`formula`	1 个匹配
`cab`	没有匹配

* - 星号

星形符号*匹配模式的零次或多次出现。

表达式	字符串	是否匹配
`ma*n`	`mn`	1 个匹配
`man`	1 个匹配
`maaan`	1 个匹配
`main`	没有匹配（`a`之后没有`n`）
`woman`	1 个匹配

+- 加号

加号+匹配左侧的模式的一个或多个出现。

表达式	字符串	是否匹配
`ma+n`	`mn`	没有匹配（没有`a`字符）
`man`	1 个匹配
`maaan`	1 个匹配
`main`	没有匹配（`a`后跟`n`）
`woman`	1 个匹配

? - 问号

问号符号?匹配模式的零或一次出现。

表达式	字符串	是否匹配
`ma?n`	`mn`	1 个匹配
`man`	1 个匹配
`maaan`	没有匹配（一个以上`a`字符）
`main`	没有匹配（`a`之后没有`n`）
`woman`	1 个匹配

{} - 大括号

考虑以下代码：{n,m}。这意味着至少n，并且最多m个重复的模式留给它。

表达式	字符串	是否匹配
`a{2,3}`	`abc dat`	没有匹配
`abc daat`	1 个匹配（在`d<u>aa</u>t`）
`aabc daaat`	2 个匹配（在`<u>aa</u>bc`和`d<u>aaa</u>t`）
`aabc daaaat`	2 个匹配（在`<u>aa</u>bc`和`d<u>aaa</u>at`）

让我们再尝试一个示例。此 RegEx [0-9]{2, 4}匹配至少 2 位但不超过 4 位

表达式	字符串	是否匹配
`[0-9]{2,4}`	`ab123csde`	1 个匹配（在`ab<u>123</u>csde`）
`12 and 345673`	3 个匹配（`<u>12</u>`，`<u>3456</u>`，`<u>73</u>`）
`1 and 2`	没有匹配

| - 替代项

竖线|用于交替（or运算符）。

表达式	字符串	是否匹配
`a\|b`	`cde`	没有匹配
`ade`	1 个匹配（在`<u>a</u>de`匹配）
`acdbea`	3 个匹配（在`<u>a</u>cd<u>b</u>e<u>a</u>`）

在此，a|b匹配包含a或b的任何字符串

() - 分组

括号()用于对子模式进行分组。例如，(a|b|c)xz匹配与a或b或c匹配的任何字符串，后跟xz

表达式	字符串	是否匹配
`(a\|b\|c)xz`	`ab xz`	没有匹配
`abxz`	1 个匹配（在`a<u>bxz</u>`匹配）
`axz cabxz`	2 个匹配（在`<u>axz</u>bc ca<u>bxz</u>`）

\ - 反斜线

反冲\用于转义包括所有元字符在内的各种字符。例如，

如果字符串包含$后跟a，则\$a匹配。在这里，$不是由 RegEx 引擎以特殊方式解释的。

如果不确定某个字符是否具有特殊含义，可以在其前面放置\。这样可以确保不对字符进行特殊处理。

特殊序列

特殊序列使常用模式更易于编写。以下是特殊序列的列表：

\A - 如果指定字符在字符串的开头，则匹配。

表达式	字符串	是否匹配
`\Athe`	`the sun`	匹配
`In the sun`	没有匹配

\b - 如果指定字符在单词的开头或结尾，则匹配。

表达式	字符串	是否匹配
`\bfoo`	`football`	匹配
`a football`	匹配
`afootball`	没有匹配
`foo\b`	`the foo`	匹配
`the afoo test`	匹配
`the afootest`	没有匹配

\B - 与\b相反。如果指定字符不是单词开头或结尾，则匹配。

表达式	字符串	是否匹配
`\Bfoo`	`football`	没有匹配
`a football`	没有匹配
`afootball`	匹配
`foo\B`	`the foo`	没有匹配
`the afoo test`	没有匹配
`the afootest`	匹配

\d - 匹配任何十进制数字。相当于[0-9]

表达式	字符串	是否匹配
`\d`	`12abc3`	3 个匹配（在`<u>12</u>abc<u>3</u>`）
`Python`	没有匹配

\D - 匹配任何非十进制数字。相当于[^0-9]

表达式	字符串	是否匹配
`\D`	`1ab34"50`	3 个匹配（在`1<u>ab</u>34<u>"</u>50`）
`1345`	没有匹配

\s - 匹配字符串包含任何空格字符的地方。等效于[ \t\n\r\f\v]。

表达式	字符串	是否匹配
`\s`	`Python RegEx`	1 个匹配
`PythonRegEx`	没有匹配

\S - 匹配字符串包含任何非空白字符的地方。等效于[^ \t\n\r\f\v]。

表达式	字符串	是否匹配
`\S`	`a b`	2 个匹配（在`<u>a</u> <u>b</u>`）
	没有匹配

\w - 匹配任何字母数字字符（数字和字母）。等效于[a-zA-Z0-9_]。顺便说一下，下划线_也被认为是字母数字字符。

表达式	字符串	是否匹配
`\w`	`12&": ;c`	3 个匹配（在`<u>12</u>&": ;<u>c</u>`）
`%"> !`	没有匹配

\W - 匹配任何非字母数字字符。相当于[^a-zA-Z0-9_]

表达式	字符串	是否匹配
`\W`	`1a2%c`	1 个匹配（在`1<u>a</u>2<u>%</u>c`）
`Python`	没有匹配

\Z - 如果指定字符在字符串的末尾，则匹配。

表达式	字符串	是否匹配
`Python\Z`	`I like Python`	1 个匹配
`I like Python Programming`	没有匹配
`Python is fun.`	没有匹配

提示：要构建和测试正则表达式，可以使用 RegEx 测试器工具，例如 regex101 。该工具不仅可以帮助您创建正则表达式，还可以帮助您学习它。

现在，您了解了 RegEx 的基础知识，让我们讨论如何在 Python 代码中使用 RegEx。

Python RegEx

Python 有一个名为re的模块可用于正则表达式。要使用它，我们需要导入模块。

import re

该模块定义了一些可与 RegEx 一起使用的函数和常量。

`re.findall()`

re.findall()方法返回包含所有匹配项的字符串列表。

示例 1：`re.findall()`

 # Program to extract numbers from a string

import re

string = 'hello 12 hi 89\. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

# Output: ['12', '89', '34']

如果找不到该模式，则re.findall()返回一个空列表。

`re.split()`

re.split方法在存在匹配项的地方拆分字符串，并返回发生拆分的字符串列表。

示例 2：`re.split()`

 import re

string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

# Output: ['Twelve:', ' Eighty nine:', '.']

如果找不到该模式，则re.split()返回一个包含原始字符串的列表。

您可以将maxsplit参数传递给re.split()方法。这是将要发生的最大拆分次数。

 import re

string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = '\d+'

# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)

# Output: ['Twelve:', ' Eighty nine:89 Nine:9.']

顺便说一下，maxsplit的默认值为 0；意味着所有可能的分裂。

`re.sub()`

re.sub()的语法为：

re.sub(pattern, replace, string)

该方法返回一个字符串，其中匹配的匹配项被replace变量的内容替换。

示例 3：`re.sub()`

 # Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

# Output: abc12de23f456

如果找不到该模式，则re.sub()返回原始字符串。

您可以将count作为第四个参数传递给re.sub()方法。如果省略，则结果为 0。这将替换所有出现的事件。

 import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'
replace = ''

new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

# Output:
# abc12de 23
# f45 6

`re.subn()`

re.subn()与re.sub()相似，但期望它返回 2 个项目的元组，其中包含新字符串和进行的替换数目。

示例 4：`re.subn()`

 # Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

# Output: ('abc12de23f456', 4)

`re.search()`

re.search()方法采用两个参数：模式和字符串。该方法查找 RegEx 模式与字符串匹配的第一个位置。

如果搜索成功，则re.search()返回一个匹配对象。如果不是，则返回None。

match = re.search(pattern, str)

示例 5：`re.search()`

 import re

string = "Python is fun"

# check if 'Python' is at the beginning
match = re.search('\APython', string)

if match:
  print("pattern found inside the string")
else:
  print("pattern not found")  

# Output: pattern found inside the string

在此，match包含一个匹配对象。

匹配对象

您可以使用dir()函数获取匹配对象的方法和属性。

匹配对象的一些常用方法和属性是：

`match.group()`

group()方法返回字符串中匹配的部分。

示例 6：匹配对象

 import re

string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
  print(match.group())
else:
  print("pattern not found")

# Output: 801 35

在此，match变量包含一个匹配对象。

我们的模式(\d{3}) (\d{2})具有两个子组(\d{3})和(\d{2})。您可以获取这些带括号的子组的字符串的一部分。这是如何做：

>>> match.group(1)
'801'

>>> match.group(2)
'35'
>>> match.group(1, 2)
('801', '35')

>>> match.groups()
('801', '35')

`match.start()`，`match.end()`和`match.span()`

start()函数返回匹配的子字符串的开头的索引。同样，end()返回匹配的子字符串的结束索引。

>>> match.start()
2
>>> match.end()
8

span()函数返回一个包含匹配部分的开始和结束索引的元组。

>>> match.span()
(2, 8)

`match.re`和`match.string`

匹配对象的re属性返回正则表达式对象。同样，string属性返回传递的字符串。

>>> match.re
re.compile('(\\d{3}) (\\d{2})')

>>> match.string
'39801 356, 2102 1111'

我们已经介绍了re模块中定义的所有常用方法。如果您想了解更多信息，请访问 Python 3 re模块。

在 RegEx 之前使用`r`前缀

在正则表达式前使用r或R前缀时，表示原始字符串。例如，'\n'是换行，而r'\n'表示两个字符：反斜杠\后跟n。

反冲\用于转义包括所有元字符在内的各种字符。但是，使用r前缀会使\被视为正常字符。

示例 7：使用`r`前缀的原始字符串

 import re

string = '\n and \r are escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)

# Output: ['\n', '\r']