Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions .docsearch.README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# DocSearch Configuration

This directory contains the Algolia DocSearch configuration for the Apache Camel website.

## Overview

The `.docsearch.config.json` file defines how Algolia's crawler indexes the Camel website for search functionality. This configuration ensures that all relevant content is discoverable through the site search, including:

- All component documentation (not just canonical versions)
- Tables with component specifications and supported models
- Metadata sections and inline code
- Multiple documentation versions (next, latest, and release branches)

## Key Configuration Elements

### Index Settings (`index`)
- **name**: `apache_camel` - The Algolia index where content is stored
- **startUrls**: Entry points for the crawler
- **pathsToMatch**: URL patterns to include in indexing
- **pathsToIgnore**: URLs to skip (search pages, error pages, etc.)
- **includeHeadingLevels**: All heading levels (h1-h6) are indexed for better navigation

### Content Selectors (`selectors`)

These CSS selectors define what content gets indexed:

- **lvl0-lvl5**: Heading hierarchy (h1-h6) used to build the breadcrumb structure
- **text**: Main content to index including:
- Paragraphs (`p`), list items (`li`)
- Table cells (`td`, `th`) - **Important for component specs**
- Definition terms (`dt`, `dd`)
- Code blocks (`code`, `pre`)

This ensures keywords like "PyTorch" in Model Zoo tables are indexed, fixing issue #1209.

### Exclusions (`selectors_exclude`)

Navigation, sidebars, footers, and other non-content elements are excluded to improve search quality:
- `.no_index`, `[data-no-index]` - Custom exclusion attributes
- Navigation elements: `nav`, `.navbar`, `.menu`, `.sidebar`, `.toc`
- Footer and copyright
- Hidden elements: `.hidden`, `[aria-hidden='true']`

### Crawling Rules (`crawler`)

- **maxDepth**: 20 - Allows deep navigation through component docs
- **maxUrls**: 50,000 - Sufficient for Camel's comprehensive documentation
- **sitemapUrls**: Uses sitemap for efficient crawling
- **timeoutMs**: 30,000 - Adequate for large pages with tables

### Multi-Version Support (`start_urls`)

The configuration crawls multiple documentation versions:

1. **next** (page_rank: 5) - Development version
2. **latest** (page_rank: 5) - Latest stable
3. **\d+\.\d+\.\x** (page_rank: 4) - Release branches (4.4.x, 4.10.x, etc.)
4. **manual** (page_rank: 7) - Core documentation (highest priority)
5. **docs** (page_rank: 6) - General documentation
6. **blog** (page_rank: 3) - Blog posts

This addresses the issue where only canonical (4.4.x) pages were indexed.

### Search Behavior (`custom_settings`)

- **searchableAttributes**: Fields available for full-text search
- **separatorsToIndex**: Include underscores, dots, and dashes in search (important for component names like `camel-k`)
- **attributeForDistinctResults**: Deduplicate results by URL to avoid showing the same page multiple times

## Maintenance

When making changes to this configuration:

1. **Test locally** - Build the site and verify crawling works
2. **Document changes** - Explain why selectors or URLs were modified
3. **Consider impacts** - Changes affect search indexing across all users
4. **Verify coverage** - Use Algolia dashboard to check what's indexed

### Common Modifications

**Adding new documentation sections:**
```json
{
"url": "https://camel.apache.org/new-section/",
"page_rank": 5
}
```

**Excluding problematic content:**
```json
"selectors_exclude": [
".no_index",
".problematic-element"
]
```

**Adjusting content extraction:**
Modify the `text` selector in the `selectors` section to include additional elements.

## Related Issue

- **Issue #1209**: "The search is not finding several fields"
- Problem: Keywords like Bradley, firmata, PyTorch not indexed from component documentation
- Root cause: Missing configuration for table content and non-canonical versions
- Solution: This configuration file with improved selectors and multi-version crawling

## References

- [Algolia DocSearch Documentation](https://docsearch.algolia.com/)
- [Camel Website GitHub](https://github.com/apache/camel-website)
- [Issue #1209](https://github.com/apache/camel-website/issues/1209)
125 changes: 125 additions & 0 deletions .docsearch.config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
{
"index": {
"name": "apache_camel",
"startUrls": [
"https://camel.apache.org/"
],
"ignoreCanonicalTo": false,
"pathsToMatch": [
"https://camel.apache.org/**"
],
"pathsToIgnore": [
"https://camel.apache.org/search",
"https://camel.apache.org/404.html"
],
"includeHeadingLevels": [1, 2, 3, 4, 5, 6],
"stripQueryParameters": true
},
"crawler": {
"userAgent": "Algolia Crawler",
"maxDepth": 20,
"maxUrls": 50000,
"waitUntilFired": true,
"timeoutMs": 30000,
"sitemapUrls": [
"https://camel.apache.org/sitemap.xml"
],
"ignoreRobotsTxt": false,
"allowedDomains": [
"camel.apache.org"
]
},
"selectors": {
"lvl0": {
"selector": "h1",
"global": true,
"default_value": "Documentation"
},
"lvl1": "h2",
"lvl2": "h3",
"lvl3": "h4",
"lvl4": "h5",
"lvl5": "h6",
"text": "p, li, td, th, dt, dd, span:not(.tooltip), div:not([class*='hidden']), table tbody, code, pre"
},
"selectors_exclude": [
".no_index",
"[data-no-index]",
".sidebar",
".breadcrumb",
"nav",
".navbar",
".menu",
".toc",
"footer",
".footer",
".copyright",
".hide",
".hidden",
"[aria-hidden='true']",
"script",
"style",
".language-toggle",
".sidebar-toggle"
],
"min_indexed_level": 1,
"only_content_level": false,
"start_urls": [
{
"url": "https://camel.apache.org/components/next/",
"page_rank": 5
},
{
"url": "https://camel.apache.org/components/latest/",
"page_rank": 5
},
{
"url": "https://camel.apache.org/components/(\\d+)\\.(\\d+)\\.x/",
"page_rank": 4
},
{
"url": "https://camel.apache.org/manual/",
"page_rank": 7
},
{
"url": "https://camel.apache.org/docs/",
"page_rank": 6
},
{
"url": "https://camel.apache.org/blog/",
"page_rank": 3
},
{
"url": "https://camel.apache.org/",
"page_rank": 8
}
],
"stop_urls": [
"\\?",
"#"
],
"custom_settings": {
"separatorsToIndex": "_.-",
"attributesForFaceting": [
"version"
],
"attributesToIndex": [
"hierarchy",
"content",
"url"
],
"minWordSizefor1Typo": 4,
"minWordSizefor2Typos": 8,
"exactOnSingleWordQuery": "none",
"attributeForDistinctResults": "url",
"searchableAttributes": [
"hierarchy.lvl0",
"hierarchy.lvl1",
"hierarchy.lvl2",
"hierarchy.lvl3",
"hierarchy.lvl4",
"hierarchy.lvl5",
"content"
]
}
}
29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -453,6 +453,35 @@ all generated sources in the project first.

Of course this then takes some more time than an optimized rebuild (time to grab another coffee!).

## Search Indexing Configuration

The website uses [Algolia DocSearch](https://docsearch.algolia.com/) to provide site-wide search functionality. The search configuration is defined in [`.docsearch.config.json`](.docsearch.config.json).

### What is indexed

The configuration ensures that Algolia's crawler indexes:
- All documentation versions (development `next`, latest, and release branches like `4.4.x`)
- Component specifications and tables (fixing issue #1209)
- All heading levels and content blocks
- Code blocks and inline code snippets

### Maintaining the search configuration

If you need to modify what gets indexed or how content is crawled:

1. Edit [`.docsearch.config.json`](.docsearch.config.json) to change selectors or crawling rules
2. Review the detailed documentation in [`.docsearch.README.md`](.docsearch.README.md)
3. Test your changes by building the site locally: `yarn build`
4. Verify content is indexable by visiting the search functionality in the preview

Key elements to be aware of:
- **Selectors** define what HTML elements are indexed (headings, paragraphs, tables, code)
- **start_urls** control which parts of the site are crawled and their search priority
- **selectors_exclude** specify elements to skip (navigation, sidebars, footers)
- **custom_settings** control search behavior and index settings

For more details, see [`.docsearch.README.md`](.docsearch.README.md).

# Checks, publishing the website

The content of the website, as built by the [Camel.website](https://ci-builds.apache.org/job/Camel/job/Camel.website/job/main/)
Expand Down
2 changes: 2 additions & 0 deletions antora-ui-camel/src/css/header.css
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,8 @@ html:not([data-scroll='0']) .navbar {
margin-right: 10px;
overflow-y: auto;
max-height: 80vh;
max-width: min(600px, 90vw);
min-width: 300px;
scrollbar-width: thin; /* Firefox */
}

Expand Down
Loading