Any regular expression gurus here?

ozoned

Well-Known Member
Joined
Mar 30, 2010
Messages
269
Reaction score
4
(NM) Any regular expression gurus here?

I'm writing a python script to produce a document outline, from headers in a Restructured Text document.

EDIT: Never mind, I found one way:

Code:
^([=|`|\-|]+\n[\w| |]+\n[=|`|\-]+|[\w| ]+\n[=|`|\-]+)

The regex I've come up with is:
Code:
header_pat = r"""^([\n|[=|\-|~|`|\+]+]?[\w| |]+\n[=|\-|~|`|\+]+)"""
header_re = re.compile(header_pat, re.M)

That gives me all the headers ok, but I get an extraneous blank line at the beginning of each match.
What I'm looking for is a regex that does not match the previous blank line. The issue being RSt
allows optional overlines.
The regex above is just a prototype, so does not yet match all possible header characters.

There may be a beer involved, no promises :-)

The following shows the match I am getting, and the match I want:
Code:
<~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

=======================================
 reStructuredText Markup Specification
=======================================

Testing header
==============

-----------------------
 Quick Syntax Overview
-----------------------

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~>>
=======================================
 reStructuredText Markup Specification
=======================================
Testing header
==============
-----------------------
 Quick Syntax Overview
-----------------------

The sample text follows:
Code:
.. -*- coding: utf-8 -*-

=======================================
 reStructuredText Markup Specification
=======================================

reStructuredText_ is plaintext that uses simple and intuitive
constructs to indicate the structure of a document.

Testing header
==============

Simple, implicit markup is used to indicate special constructs, such
as section headings, bullet lists, and emphasis.  The markup used is

reStructuredText is applicable to documents of any length, from the
very small (such as inline program documentation fragments, e.g.

-----------------------
 Quick Syntax Overview
-----------------------

A reStructuredText document is made up of body or block-level
elements, and may be structured into sections.  Sections_ are

Here are examples of `body elements`_:

----------------
 Syntax Details
----------------

Descriptions below list "doctree elements" (document tree element
names; XML DTD generic identifiers) corresponding to syntax

Whitespace
==========

Spaces are recommended for indentation_, but tabs may also be used.
Tabs will be converted to spaces.  Tab stops are at every 8th column.

Other whitespace characters (form feeds [chr(12)] and vertical tabs
[chr(11)]) are converted to single spaces before processing.

Blank Lines
-----------

Blank lines are used to separate paragraphs and other elements.
Multiple successive blank lines are equivalent to a single blank line,

RCS Keywords
````````````

`Bibliographic fields`_ recognized by the parser are normally checked
for RCS [#]_ keywords and cleaned up [#]_.  RCS keywords may be

------
Other
------

Indentation
-----------

Indentation is used to indicate -- and is only significant in
indicating -- block quotes, definitions (in definition list items),
 
Last edited:

This may come as a total shock. I am using docutils. I'm looking to produce nice little document outlines from my thousands of restructured text docs.
 
Would changing the pattern to the following help?

Code:
header_pat = r"""^[\n]?([\n|[=|\-|~|`|\+]+]?[\w| |]+\n[=|\-|~|`|\+]+)"""
 
Hi icyrus:
That gets me half-way there. The overline/underline style header match ok, but the underline only style still matches an extraneous leading newline.
(Only half a beer for you! :-)

I managed to get what I am after with a horrible alternation style regex, (See my edit at top of post).
I was hoping someone might have some fancy lookaround-fu ;)

Thanks for the suggestion though.
 
Top
Sign up to the MyBroadband newsletter
X