Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 5 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
{}
{
"rust-analyzer.cargo.extraEnv": {
"SQLX_OFFLINE": "false"
}
}
23 changes: 23 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ test-log = { version = "0.2.18", features = [
], default-features = false }
itertools = "0.14.0"
insta = "1.43.2"
pretty_assertions = "1.4.1"

[package.metadata.bin]
just = { version = "1.38.0", locked = true }
Expand Down
2 changes: 1 addition & 1 deletion Justfile
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ exec-database-cli: start-database
podman exec -ti -u postgres linkblocks_postgres psql ${DATABASE_NAME}

generate-database-info: start-database migrate-database
cargo bin sqlx-cli prepare -- --all-targets
SQLX_OFFLINE=false cargo bin sqlx-cli prepare -- --all-targets

start-test-database:
#!/usr/bin/env bash
Expand Down
47 changes: 47 additions & 0 deletions doc/search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Search

Design document for full-text search through

- bookmark titles
- content of bookmarked html websites
- list titles
- list descriptions (called "content" in the code)

## Requirements

- No extra service to deploy
- Up to a certain size, search should take <500ms and be a lot faster for small datasets
- Index size should be reasonable, e.g. not more than 2x of original content
- target: 50 users per instance with 50k bookmarks each should be easy for hosters
- need to return matched positions for highlighting them in search results
- should support language-aware stemming
- Should have a "good" way to rank results

## "Good" Ranking

- Should some level of fuzziness be involved?
- Weight matches in the bookmark title higher than website content
- BM25 is a good baseline

## PostgreSQL full-text-search

- Seems like the ranking functions `ts_rank` and `ts_rank_cd` have pretty heavy performance impact, although for the size of linkblocks this might not be a problem
- No BM25 ranking, but it can at least normalize word frequency by document length
- tsvectors can only take a limited number of lexeme positions (but an unlimited number of lexemes?)
- Can rank different parts of the tsvector differently (title, body)
- quicker to implement than Tantivy
- Let's evaluate how good the ranking actually is

## Tantivy

- Uses BM25 for ranking
- Ranking is faster than with Postgres
- Extra implementation complexity if indexes are cached on disk, or extra memory & compute required if they are stored in-memory
- Check the indexing performance & space requirements
- For on-disk storage, needs extra work to robustly handle file corruption, recovery from bugs, etc.
- Alternatively: store indexes in postgres

## pg_search

- Implements BM25 and more in postgres using tantivy
- [Requires custom postgres installation](https://github.com/paradedb/paradedb/tree/main/pg_search#installation)
1 change: 1 addition & 0 deletions src/db/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ pub use users::User;
pub mod bookmarks;
pub mod migration_hooks;
pub use bookmarks::Bookmark;
pub mod search;

pub async fn migrate(pool: &PgPool, base_url: &Url, up_to_version: Option<i64>) -> Result<()> {
tracing::info!("Migrating the database...");
Expand Down
106 changes: 106 additions & 0 deletions src/db/search.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
use sqlx::{query, query_as};
use uuid::Uuid;

use super::AppTx;
use crate::response_error::ResponseResult;

pub enum PreviousPage {
DoesNotExist,
IsFirstPage,
AfterBookmarkId(Uuid),
}

pub struct Results {
pub bookmarks: Vec<Result>,
pub previous_page: PreviousPage,
pub next_page_after_bookmark_id: Option<Uuid>,
}

pub struct Result {
pub title: String,
pub bookmark_id: Uuid,
pub bookmark_url: String,
}

pub async fn search(
tx: &mut AppTx,
term: &str,
ap_user_id: Uuid,
after_bookmark_id: Option<Uuid>,
) -> ResponseResult<Results> {
let bookmarks = query_as!(
Result,
r#"
select title, url as bookmark_url, id as bookmark_id
from bookmarks
where (bookmarks.title ilike '%' || $1 || '%')
and bookmarks.ap_user_id = $2
and ($3::uuid is null or bookmarks.id > $3)
order by bookmarks.id asc
limit 4
"#,
term,
ap_user_id,
after_bookmark_id
)
.fetch_all(&mut **tx)
.await?;

let last_id = bookmarks.last().map(|b| b.bookmark_id);
let next_page_after_bookmark_id = query!(
r#"
select bookmarks.id
from bookmarks
where (bookmarks.title ilike '%' || $1 || '%')
and bookmarks.ap_user_id = $2
and ($3::uuid is null or bookmarks.id > $3)
"#,
term,
ap_user_id,
last_id
)
.fetch_optional(&mut **tx)
.await?
.and(last_id);

let first_id = bookmarks.first().map(|b| b.bookmark_id);
// Check if there are *any* bookmarks before the first of the current page.
// If so, fetch the ids for the previous page and take the first one.
// We need to fetch multiple bookmarks because we don't know how small the
// previous page is.
let previous_bookmarks = query!(
r#"
select bookmarks.id
from bookmarks
where (bookmarks.title ilike '%' || $1 || '%')
and bookmarks.ap_user_id = $2
and ($3::uuid is null or bookmarks.id < $3)
order by bookmarks.id desc
limit 5
"#,
term,
ap_user_id,
first_id
)
.fetch_all(&mut **tx)
.await?;
let previous_page = if let Some(last) = previous_bookmarks.last() {
if previous_bookmarks.len() == 5 {
// There's another page before the previous page, so we can reference the last
// bookmark of that page.
PreviousPage::AfterBookmarkId(last.id)
} else {
// This is the first page, so we have no bookmark id to query "after"
PreviousPage::IsFirstPage
}
} else {
// the query returned 0 results, so there is no previous page.
PreviousPage::DoesNotExist
};

Ok(Results {
bookmarks,
previous_page,
next_page_after_bookmark_id,
})
}
1 change: 1 addition & 0 deletions src/routes/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ pub mod index;
pub mod links;
pub mod lists;
pub mod users;
pub mod search;
45 changes: 45 additions & 0 deletions src/routes/search.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
use axum::{Router, extract::Query, routing::get};
use serde::{Deserialize, Serialize};
use uuid::Uuid;

use crate::{
authentication::AuthUser,
db::{self},
extract,
htmf_response::HtmfResponse,
response_error::ResponseResult,
server::AppState,
views,
views::layout,
};
pub fn router() -> Router<AppState> {
let router: Router<AppState> = Router::new();
router.route("/search", get(get_search))
}

#[derive(Deserialize, Serialize)]
pub struct SearchQuery {
/// The words to search for
pub q: String,
pub after_bookmark_id: Option<Uuid>,
}

async fn get_search(
auth_user: AuthUser,
extract::Tx(mut tx): extract::Tx,

Query(query): Query<SearchQuery>,
) -> ResponseResult<HtmfResponse> {
let results = db::search::search(
&mut tx,
&query.q,
auth_user.ap_user_id,
query.after_bookmark_id,
)
.await?;
let mut layout = layout::Template::from_db(&mut tx, Some(&auth_user)).await?;
layout.previous_search_input = Some(query.q);
Ok(HtmfResponse(views::search_results::view(
&views::search_results::Data { layout, results },
)))
}
1 change: 1 addition & 0 deletions src/server.rs
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ pub async fn app(state: AppState) -> anyhow::Result<Router> {
.merge(routes::bookmarks::router())
.merge(routes::links::router())
.merge(routes::federation::router())
.merge(routes::search::router())
.merge(routes::assets::router().with_state(()))
// TODO add layer to use the same URL for AP and HTML
// this should simplify things and be more error tolerant for other services
Expand Down
1 change: 1 addition & 0 deletions src/tests/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@ mod federation;
mod index;
mod lists;
mod migrations;
mod search;
mod users;
mod util;
Loading
Loading