Issue Details (XML | Word | Printable)

Key: REPOR-157
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: David Peterson
Reporter: Stephen Morad
Votes: 1
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Confluence Extension: Reporting Plugin

Text filter of content:body not returning results

Created: 11/Jul/08 09:23 PM   Updated: 02/Sep/08 08:24 AM
Component/s: None
Affects Version/s: 2.1.1
Fix Version/s: 3.0.0

Time Tracking:
Not Specified

File Attachments: 1. Java Source File RegexParser.java (37 kB)
2. Zip Archive SystemOut.zip (4.28 MB)

Environment: Linux, Tomcat, Java 1.5.0_12, Confluence 2.8.0


 Description  « Hide
The following code works perfectly if content:body is changed to content:title. But using content:body the report macro always returns "No content body matches this request."
{report-table}

{content-reporter:spaces=myspace|type=page|scope=myspace:Status > descendents}
{text-sort:content:title}
{text-filter:content:body|include=.*Approved.*}
{content-reporter}

{report-column:title=Request Page}{report-info:content:title|link=true}{report-column}
{report-column:title=Status}(/){report-column}

{report-empty}_No content body matches this request._{report-empty}

{report-table}


 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
David Peterson added a comment - 12/Jul/08 12:59 AM
Weird. I can't think of any reason why that would be the case...

I've added it to be examined for the next version.


David Peterson added a comment - 12/Jul/08 01:00 AM
Could you please provide an example of the actual wiki markup on the page body you're matching to?

Steve Colyer added a comment - 30/Jul/08 12:54 PM
Hello,

We too have this issue. In trying to display subpages of a current page, we attempted to use the text-filter within a report-list. In the first section of sample code, you can see it setting up the text-filter based on content:title looking for a string with an "a" in it, and to report on that. And it does so nicely, basically printing out the content:title and content:body of each subpage. However the second section with the text-filter based on content:body does not print anything except the error marker "There are no sample Learning More... modules." It too searches for an "a" and one can clearly see it in the text printed as a result of the first section. Why is the macro not finding this in the content:body, but can find it in content:title? Additionally, there are only about 10 subpages and when I insert this code response time for the page goes from about
1-2 seconds to 10-15 seconds. Why is that?

What follows is code from our Confluence page demonstrating the error:

{report-list}
{content-reporter:scope=CSC Collaboration Workshop Modules>children}
{text-filter:content:title|include=.*a.*} {content-reporter}
{report-body}{report-link:content:title}{report-info:content:title}---{report-info:content:body}(report-link} {report-body}
{report-empty}There are no sample Learning More... modules.{report-empty} {report-list}

Separator

{report-list}
{content-reporter:scope=CSC Collaboration Workshop Modules>children}
{text-filter:content:body|include=.*a.*} {content-reporter}
{report-body}{report-link:content:title}{report-info:content:title}---{report-info:content:body}{report-link} {report-body}
{report-empty}There are no sample Learning More... modules.{report-empty} {report-list}

Sample output from macro code:

High Definition Video Conferencing Exploration with LifeSize Communications---h2. CSC Collaboration Module - High Definition Video Conferencing Exploration with LifeSize Communications h4. Time Required: 2 hours h4. Module Category: GettingTheOrganizationToUseCollaboration h4. Delivery Style: Demo h4. Summary The purpose of this

... about 200 lines of text ...

Updates to Materials h4. Facilitator Notes * Thought Provoking Questions * Concepts to Highlight * Cautions and Risk Management * Expert Resources(report-link}
Separator

There are no sample Learning More... modules.

What else do you need to solve the problem? Can you think of an alternative approach that would let me select subpages by text contained in content:body? Let me know if there are any questions.

Steve


David Peterson added a comment - 30/Jul/08 01:50 PM
Thinking about about it further, I think I know what the problem is. By default, Pattern will only match single non-new-lines to '.'. To get it doing what you expect, you need to prefix your pattern with "(?s)" (single-line mode), which will make newlines match '.' patterns. Eg:
{text-filter:content:body|include=(?s).*Foo.*}

I'll consider whether this should be the default mode for these macros. I think in general most users will expect it to be, and it can be manually turn it off again by using "(?-s)" if necessary.


Steve Colyer added a comment - 30/Jul/08 10:09 PM
Hello David,

Well, Thank you, Thank you.

Your suggestion to insert the "(?s)" worked like a champ for me. It seems like content:title must not have a newline, while I'm certain that content:body does. If I'm permitted a vote, I'd like to see it become the default. At the very least, I think I'd like to see the issue mentioned in some of the documentation.

Thanx again, Steve


Steve Colyer added a comment - 30/Jul/08 10:32 PM
Hello again David,

OK, OK, I'm back again.

What I've been working on doing with this report and pattern match is to go through all the subpages of a page and look for each of 5 values within the content:body. For each subpage that matches, I report a link so it can be referenced. I go through the set of subpages 5 times, once for each of those special values. The end result is I've categorized the subpages into related groups. The are approximately 10 subpages at this point, each of which contains about 60-200 lines of text. Now the problem. The response time for this is approximately 75 seconds, or about 15 seconds for each search. This seems a bit excessive. Any ideas?

Steve


David Peterson added a comment - 31/Jul/08 03:32 AM
Yeah, we will at least document it. I'm leaning towards making it the default for the next version.

With regards to the performance, I'm not sure what I can suggest. The regular expression parser is about as efficient as it can get without switching to a non-standard parsing engine (and I'm not sure that any of the others are any faster). Text parsing is a pretty slow process when it comes down to it, unfortunately.

Something which could be done for these simple cases in the future is to add a 'contains XXX' key to the Text Supplier. This would let you do this:

{boolean-filter:content:body > contains Foo}

...instead of the text-filter. This would use the String.indexOf method rather than building a regular expression. I suspect this would be more efficient, but I have to say, probably not by much...

In your case, with the amount of text present (200 lines is a reasonable amount), a custom solution might be in order. With something like this, the body only changes when you edit it, so it would be ideal if you could do some processing on the page at the point when it is saved, rather than every time you view the report. A custom plugin which listened for page edit/create events and automatically generates labels on a page for certain key words, for example. Then you could use labels to group your pages, rather than parsing text every time...


Steve Colyer added a comment - 01/Aug/08 11:30 PM
Hello David,

What you said about regex parsing concerned me. I recall it not being great, but I had a hard time understanding how it could be a major contributor to 10-20 second response time. So I ended up writing a Java test. It turns out there were only 8 subpages of this page in question having the long response time. So what I decided to try was writing a bit of code that used the Patterns class you mentioned to parse through the content:body of each of those 8 subpages. Not being willing to learn all the APIs for this trial, I instead went into Confluence to each of the 8 pages in edit mode and extracted the complete text of the body of the page, then embedded that in the Java code. Now by far and away, I realize this is not everything that the macro does, but you seemed to think that regex may have played a significant part. Well, when I compiled and ran it on my 3 year old laptop, the elapsed time is about 16 ms. A few orders of magnitude away from the response times. Are you sure this isn't something other than regex parsing? What am I missing here?

Steve

P.S. I've included the code here as well, if you're interested.


David Peterson added a comment - 02/Aug/08 02:29 AM
Honestly, performance stuff is purely speculative currently. Reporting in general can be slow, so it's not purely regular expressions or any one thing. But they do all contribute - particularly over larger data sets than what you're working with.

A question: If you just run a single copy of the report without doing any filtering vs. with a single text-filter, what is the performance difference? Does the report still take a similar amount of time? If it does, it is obviously some other part of the report. If not, there is something bad happening in the generation or execution of filtered reports.

If you want to post your full report code I may be able to spot common performance-sucking techniques...

In any case, general performance is one thing we will be focusing on for upcoming releases of Reporting, so watch this space.


Steve Colyer added a comment - 04/Aug/08 01:58 PM
Hello David,

OK, Here goes. When I take all reporting constructs out of the code, my response time is about 5 secs. If I re-insert a single instance of my reporting code with the text filter, I get response times of 15-16 secs. If I remove the single text-filter line, my response times run about 14-15 secs. If I put back in all 5 instances of my reporting code with text-filters, the response time is about 25-30 secs.

The code in my current reporting construct is:

{report-block:separator=newline}
{content-reporter:scope=CSC Collaboration Workshop Modules>children}
{text-filter:content:body|include=(?s).*IntroductoryTopics.*}{content-reporter}
{report-body}{report-link:content:title}{report-info:content:title}{report-link}{report-body}
{report-empty}There are no sample Introductory Topics modules.{report-empty}{report-block}

Very similar to my earlier entries, but with your filtering suggestion incorporated.

Thoughts?

Steve


Steve Colyer added a comment - 04/Aug/08 02:24 PM
Hello again David,

Reformatted the code to get explicit line structure so that when I said "remove the single text-filter line" it would make sense.

{report-block:separator=newline} 
{content-reporter:scope=CSC Collaboration Workshop Modules>children} 
{text-filter:content:body|include=(?s).*LearningMoreAboutCollaboration.*} 
{content-reporter} 
{report-body}{report-link:content:title}{report-info:content:title}{report-link} 
{report-body} 
{report-empty}There are no sample Learning More about Collaboration modules.{report-empty} 
{report-block}

Also, we've done performance profiling of this issue, and can upload it if you believe it would help to see it.

Steve


David Peterson added a comment - 04/Aug/08 04:19 PM
Thanks for doing that Steve. This indicates that most of the time is being spent either in the report generation or the report output. I would bet that most of it is in the former, with the pretty simple output you have, but I've been wrong before.

The profiling data would be useful, assuming I can interpret it.


Steve Colyer added a comment - 04/Aug/08 10:26 PM
Hello David,

Attached you'll find a zip of SystemOut.log. This is a log from the IBM Websphere server supporting Confluence on a RHEL box. This is one server in a clustered configuration of two. This is a pretty long log file. The entry you're looking for starts a 17:35:38.697. The profile trace at this point is for a page in the collab space that invokes an instance of the earlier debug version of the macro invocation. Remember this profile entry appears after the GET has completed. The total duration for this is about 10 seconds, so you may have to go back a bit to look for other pertinent entries in the trace. From the bit I looked at it, it seemed like it was spending more time in the first section as well. Please take a look and see what you can find and let me know if you need anything else as well.

Good luck.

Steve


David Peterson added a comment - 05/Aug/08 02:40 AM
Hi Steve,

Having dug through the trace (once I managed to find it - you're right - the file is huge , it looks like there are two main time sinks:

1) The query itself (almost 1000ms)
2) Permissions checking (usually around 25ms, occasionally up to 150ms, but there are many of them)

I noticed elsewhere in the logs that you're using LDAP. I'll assume that you've got caching turned on, or I suspect it would be even slower, but you never know. Some versions of Confluence shipped with the LDAP configuration not being cached. That would mean every permission check has to bounce to the LDAP server, which would definitely slow things down. Search Atlassian's site for 'LDAP caching' to check your configuration is correct, or open a request with Atlassian Support.

That said, permission checking in general will always have a performance hit, just because it has to check global and space permissions for both individual users and user groups.

Anyway, thanks for the report, it's given us a couple of things to look at to see if we can improve them. We'll keep you posted.


Steve Colyer added a comment - 05/Aug/08 01:25 PM
Hello David,

Thanx for looking into this further.

In thinking about what you said for permission checking, I could possibly understand this if the number of sub pages were huge, but they're not. Direct children there's something like 8 and if we include all descendants, it's less than 20. So if we take an average permission check response time of say 35ms, it seems that you'd have a total of 280-700 ms. However, what we're seeing here is a total of something like 7 seconds devoted to this permission checking process, meaning, say 200 permissions checks. Any ideas as to why?

Also, I don't think it's LDAP, because we happened to do an LDAP trace during this session for some other Atlassian problems we were having. The test ran for about 40-50 minutes, with the testing executing about 15-18 transactions in particular. All the while normal evening production was taking place on the same server as well. In total during that time period, there were < 700 LDAP Tx, including bind, unbind, and searches. I also went back to the trace just now and in the period 17:35:00 - 17:36:00, there were 17 LDAP search Tx. So, I don't think an it's an LDAP issue.

So I'm still not understanding why so many permissions checks, and that if it's not going to LDAP, why it's taking so long to do them.

Any ideas?

Steve


David Peterson added a comment - 05/Aug/08 03:34 PM
With regards to some of the extra checks, having a quick look at the code it's possible that the ContentReporterMacro is doing double-checks on content permissions. I will need to do proper testing to see what happens when I remove the checks (the original checks may or may not be happening inside Atlassian's code - need to confirm). Still, that's basically only one extra call to PermissionManager.hasPermission() (which, in turn checks groups/users/etc.) per page. I'm not sure where the extra 160 checks are coming from at this point.

In any case, we'll keep looking into it. If you like, it might be worth creating a separate issue for this, specific to the performance issues, rather than the original problem for this issue (regular expressions).


David Peterson added a comment - 02/Sep/08 08:24 AM
The include/exclude parameters will now search across multiple lines by default.