Skip to main content
 首页 » 编程设计

Python 正则表达式之从 orgmode 文件中获取项目

2024年10月01日33sharpest

我有以下组织模式语法:

** Hardware [0/1] 
 - [ ] adapt a programmable motor to a tripod to be used for panning  
** Reading - Technology [1/6] 
 - [X] Introduction to Networking - Charles Severance 
 - [ ] A Tour of C++ - Bjarne Stroustrup 
 - [ ] C++ How to Program - Paul Deitel 
 - [X] Computer Systems - Randal Bryant 
 - [ ] The C programming language - Brian Kernighan 
 - [ ] Beginning Linux Programming -Matthew and Stones 
** Reading - Health [3/4] 
 - [ ] Patrick McKeown - The Oxygen Advantage 
 - [X] Total Knee Health - Martin Koban 
 - [X] Supple Leopard - Kelly Starrett 
 - [X] Convict Conditioning 1 and 2   

我想提取项目,例如:

 getitems "Hardware" 

我应该得到:

  - [ ] adapt a programmable motor to a tripod to be used for panning   

如果我要求“阅读 - 健康”,我应该得到:

 - [ ] Patrick McKeown - The Oxygen Advantage 
 - [X] Total Knee Health - Martin Koban 
 - [X] Supple Leopard - Kelly Starrett 
 - [X] Convict Conditioning 1 and 2  

我正在使用以下模式:

   pattern = re.compile("\*\* "+ head + " (.+?)\*?$", re.DOTALL) 

请求“Reading - Technology”时的输出是:

  - [X] Introduction to Networking - Charles Severance 
  - [ ] A Tour of C++ - Bjarne Stroustrup 
  - [ ] C++ How to Program - Paul Deitel 
  - [X] Computer Systems - Randal Bryant 
  - [ ] The C programming language - Brian Kernighan 
  - [ ] Beginning Linux Programming -Matthew and Stones 
   ** Reading - Health [3/4] 
  - [ ] Patrick McKeown - The Oxygen Advantage 
  - [X] Total Knee Health - Martin Koban 
  - [X] Supple Leopard - Kelly Starrett 
  - [X] Convict Conditioning 1 and 2   

我也试过:

   pattern = re.compile("\*\* "+ head + " (.+?)[\*|\z]", re.DOTALL) 

最后一个适用于除最后一个之外的所有标题。

请求“Reading - Health”时的输出:

 - [ ] Patrick McKeown - The Oxygen Advantage 
 - [X] Total Knee Health - Martin Koban 
 - [X] Supple Leopard - Kelly Starrett 

如您所见,它与最后一行不匹配。

我正在使用 python 2.7 和 findall。

请您参考如下方法:

你可以用

import re 
 
string = """ 
** Hardware [0/1] 
 - [ ] adapt a programmable motor to a tripod to be used for panning  
** Reading - Technology [1/6] 
 - [X] Introduction to Networking - Charles Severance 
 - [ ] A Tour of C++ - Bjarne Stroustrup 
 - [ ] C++ How to Program - Paul Deitel 
 - [X] Computer Systems - Randal Bryant 
 - [ ] The C programming language - Brian Kernighan 
 - [ ] Beginning Linux Programming -Matthew and Stones 
** Reading - Health [3/4] 
 - [ ] Patrick McKeown - The Oxygen Advantage 
 - [X] Total Knee Health - Martin Koban 
 - [X] Supple Leopard - Kelly Starrett 
 - [X] Convict Conditioning 1 and 2   
 """ 
 
def getitems(section): 
    rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE) 
    try: 
        items = rx.search(string) 
        return items.group('block') 
    except: 
        return None 
 
items = getitems('Reading - Technology') 
print(items) 

查看working on ideone.com .


代码的核心是(浓缩)表达式:

^\*{2}.+[\n\r]       # match the beginning of the line, followed by two stars, anything else in between and a newline 
(?P<block>           # open group "block" 
    (?:              # non-capturing group 
        (?!^\*{2})   # a neg. lookahead, making sure no ** follows at the beginning of a line 
        [\s\S]       # any character... 
    )+               # ...at least once 
)                    # close group "block" 

** 之后插入搜索字符串的位置在实际代码中。查看 Reading - Technology 的演示在 regex101.com


作为后续行动,您也可以只返回选定的值,如下所示:

def getitems(section, selected=None): 
    rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE) 
    try: 
        items = rx.search(string).group('block') 
        if selected: 
            rxi = re.compile(r'^ - \[X\]\ (.+)', re.MULTILINE) 
            try: 
                selected_items = rxi.findall(items) 
                return selected_items 
            except: 
                return None 
         return items 
    except: 
        return None 
 
items = getitems('Reading - Health', selected=True) 
print(items)