Question

Photo of Long Pham

0

Rock.Jobs.IndexRockSite indexing wrong pages

Hi,

I'm implementing Universal search and has followed the directions on this page - https://www.rockrms.com/Rock/BookContent/32/120

I setup the Rock.Jobs.IndexRockSite job to run once a day at 11PM and to index the main external site (we'll call it Site A).  Site A contains pages from a set of pages (we'll call it Page Set A).  The job showed that it indexed 230 pages.

After the job ran, I did a search and noticed that pages from another page set (we'll call it Page Set B) showed up in the results.  So I went through the pages from Page Set B and noticed that those pages were also connected to Site A.  So I create a new site (we'll call Site B) and connected all the pages from Page Set B to Site B.

I re-ran the Rock.Jobs.IndexRockSite job and it now indexed 180 pages.  However, when I do a search in Site A, it still displayed pages from Page Set B of Site B.

Other than deleting the entire Set Page B, I'm not sure what other options I have.  Any ideas ?

  • Photo of Long Pham

    0

    Ah, that makes sense.  Since everything is stored in the database, can I do a query to search page contents of Page Set A for the string "/page/247" to see where the offending link is ?

    • Daniel Hazelbaker

      There are some SQL commands you can construct and run to try to find things - but we spent hours trying to track all that down and finally gave up. So we just let the cross-site indexing happen. It's not ideal but it means we can at least have search working. Someday we'll look at modifying Rock to somehow allow the site crawler to know what Site Id it's on so it can ignore pages on the wrong site.

  • Photo of Daniel Hazelbaker

    0

    That is something we haven't found a solution for just yet.  Here is what happens:

    Let's say you have two sites: site-a.com and site-b.com; both are hosted in Rock.

    You have a bunch of pages on site A, lets call them page numbers 100 - 199.

    You have a bunch of pages on site B, lets call them page numbers 200 - 299.

    Let's also assume that on page 115, you have a link to page 247 - and it's just a relative link of /page/247 because somebody wasn't paying attention and didn't notice because it also worked fine. Normally that page is viewed as site-b.com/page/247, but in this case it's being linked as site-a.com/page/247 which works perfectly fine since Rock gets the Site from the page number (so it still displays the right theme and such).

    So, when you configure site A to be indexed, it crawls site A, eventually hits page 115 on Site A and gets linked over to /page/247. The crawler still thinks it's on site A because the domain is still correct. But it's now crawling pages on site B.

    Hope this helps explain, but again I don't know of a way to fix it.

  • Photo of David Stevens

    0

    To post a possible solution here, I had to disable Lucene and remove the index results entirely (~/AppData/Lucene) to force a reindex on the site where the domain changed. 

  • Photo of Long Pham

    0

    David, I think we may need to do this as well because a new problem came up.  We ended up deleting some of the pages from Page Set B but they still appear in the search results.  So when you click on one of the links to these pages in the search results, you get a 404.  When you say you disabled Lucene and remove the results, did you mean you just deleted the content of the ~/AppData/Lucene folder ?