Skip to content

Conversation

@mrmps
Copy link
Owner

@mrmps mrmps commented Dec 7, 2025

Summary by CodeRabbit

  • New Features

    • Added automatic right-to-left (RTL) language support with intelligent text direction detection for articles and AI responses.
  • Bug Fixes

    • Improved URL validation and normalization to correctly handle malformed protocol formats.

✏️ Tip: You can customize this high-level summary in your review settings.

@vercel
Copy link

vercel bot commented Dec 7, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
smry Error Error Dec 7, 2025 10:20am

@coderabbitai
Copy link

coderabbitai bot commented Dec 7, 2025

Walkthrough

Pull request introduces RTL (Right-to-Left) language support across the application by adding language detection utilities, propagating lang and dir metadata through API routes and components, updating stylesheets for RTL layout, and migrating the build system from Node/pnpm to Bun. Additionally, URL validation is enhanced to handle malformed protocols, and a marketing feature is removed.

Changes

Cohort / File(s) Summary
Build System Migration to Bun
Dockerfile, package.json, tsconfig.json
Replaces Node.js/pnpm with Bun in Dockerfile (base image, install, build command). Adds bun-types to dependencies and updates TypeScript compiler options to include Bun type definitions.
RTL Language Support Utilities
lib/rtl.ts, lib/rtl.test.ts
Introduces RTL detection module with functions to identify RTL languages, analyze text direction via Unicode ranges, and generate direction attributes. Includes comprehensive test coverage for RTL/LTR detection across multiple languages and mixed-content scenarios.
RTL-Aware Styling
app/globals.css
Adds base-layer CSS rules for RTL support, including direction-based text alignment, prose styling adjustments, blockquote/list/code block adaptations, and forced LTR for code within RTL contexts.
Article API Language & Direction Metadata
app/api/article/route.ts, app/api/jina/route.ts
Propagates lang and dir fields through article fetch/cache/render paths. Computes htmlLang from HTML/parsed data, derives textDir via getTextDirection utility, and includes language/direction in API responses and cached article objects.
Component RTL/Language Props
components/ai/response.tsx, components/article/content.tsx, components/features/summary-form.tsx
Adds dir and lang props to Response component; propagates language/direction attributes to rendered article header and content elements; passes RTL metadata to AI response rendering.
Type Schema Updates
types/api.ts
Constrains dir field in ArticleSchema to enum values ('rtl' | 'ltr') with 'ltr' default, replacing unrestricted string type.
URL Validation Enhancement
lib/validation/url.ts, lib/validation/url.test.ts
Introduces cleanProtocol helper to sanitize malformed/duplicate protocols; updates normalizeUrl to collapse and fix protocol issues before validation; adds comprehensive test suite covering edge cases and real-world malformed URLs.
Marketing UI Cleanup
components/marketing/ad-spot.tsx
Removes Zap icon import and corresponding "Fair rotation" FeatureCard from AdDrawerContent.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

  • Areas requiring attention:
    • RTL detection algorithm correctness in lib/rtl.ts, particularly Unicode range classification and character-counting logic in detectTextDirection
    • Schema/type consistency across all touched files (verify lang/dir fields propagate correctly through article, cached, and API response types)
    • Component prop propagation chain for RTL attributes to ensure correct fallbacks and none are dropped
    • URL protocol cleaning loop in lib/validation/url.ts for edge cases (e.g., deeply nested or malformed protocols)
    • Dockerfile build correctness and feature parity with pnpm setup (cache mounting, lockfile handling)

Possibly related PRs

Poem

🐰 Hop, hop, with texts that twist and turn,
RTL rules and directions we learn!
Bun builds swift, from left to right,
Metadata flows—each article bright! 🌍✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title 'Feat/add clerk auth and premium ux' does not match the actual changes, which primarily implement RTL/LTR language support, Bun migration, and URL validation improvements with no Clerk authentication or premium UX changes present. Update the title to accurately reflect the main changes, such as 'Add RTL/LTR language support and Bun migration' or similar.
Docstring Coverage ⚠️ Warning Docstring coverage is 78.57% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/add-clerk-auth-and-premium-ux

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link

greptile-apps bot commented Dec 7, 2025

Greptile Overview

Greptile Summary

This PR adds comprehensive RTL (Right-to-Left) language support and migrates the project from Node.js/pnpm to Bun runtime.

Key Changes

  • RTL Language Support: Implemented automatic detection and proper rendering of RTL languages (Arabic, Hebrew, Persian, etc.) by extracting language metadata from HTML, analyzing Unicode character ranges, and applying appropriate CSS styling
  • URL Normalization Enhancement: Fixed handling of malformed URL protocols including duplicate protocols (https://https://) and single-slash malformations (https:/example.com)
  • Bun Migration: Migrated build system from Node.js/pnpm to Bun for improved performance, including Docker configuration and test framework updates
  • Type Safety: Added proper TypeScript types for dir and lang fields throughout the API schema and component props

Implementation Quality

The RTL implementation is well-architected with proper separation of concerns. The lib/rtl.ts utility provides language detection via ISO 639-1 codes and Unicode range analysis, with a smart fallback that analyzes text content when language metadata is unavailable. Test coverage is comprehensive with real-world Arabic, Hebrew, and Persian text samples.

The URL normalization fix addresses a production issue with iterative protocol cleaning logic that handles edge cases like multiple duplicate protocols.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are well-tested, follow established patterns, and add new functionality without breaking existing features. The RTL detection logic is robust with proper fallbacks, URL normalization fixes real bugs, and the Bun migration is a straightforward runtime swap with no logic changes. All new code has corresponding tests.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
Dockerfile 5/5 Migrated from Node.js/pnpm to Bun runtime with updated build commands
lib/rtl.ts 5/5 New RTL language detection utilities with Unicode range analysis and language code lookup
lib/validation/url.ts 5/5 Added cleanProtocol function to fix duplicate and malformed URL protocols
app/api/article/route.ts 4/5 Integrated RTL detection, extracts language from HTML attributes, adds dir/lang to cached articles
app/api/jina/route.ts 4/5 Added RTL text direction detection for Jina-sourced articles with fallback handling
components/article/content.tsx 5/5 Applied dir and lang attributes to article header and content containers
types/api.ts 5/5 Updated ArticleSchema to include dir field with 'rtl'

Sequence Diagram

sequenceDiagram
    participant User
    participant Frontend
    participant ArticleAPI as /api/article
    participant JinaAPI as /api/jina
    participant RTLLib as lib/rtl
    participant URLValidation as lib/validation/url
    participant Redis
    participant Diffbot
    participant JinaReader as Jina Reader

    User->>Frontend: Enter URL
    Frontend->>URLValidation: normalizeUrl(url)
    URLValidation->>URLValidation: cleanProtocol()
    URLValidation-->>Frontend: Normalized URL

    alt Article Fetch (smry-fast/slow/wayback)
        Frontend->>ArticleAPI: GET /api/article?url=...&source=...
        ArticleAPI->>Redis: Check cache
        
        alt Cache Hit
            Redis-->>ArticleAPI: Cached article
        else Cache Miss
            ArticleAPI->>Diffbot: Fetch article
            Diffbot-->>ArticleAPI: HTML content
            ArticleAPI->>ArticleAPI: Parse with Readability
            ArticleAPI->>ArticleAPI: Extract lang attribute
        end
        
        ArticleAPI->>RTLLib: getTextDirection(lang, textContent)
        RTLLib->>RTLLib: isRTLLanguage(lang)
        alt Language is RTL
            RTLLib-->>ArticleAPI: 'rtl'
        else Analyze content
            RTLLib->>RTLLib: detectTextDirection(textContent)
            RTLLib-->>ArticleAPI: 'rtl' or 'ltr'
        end
        
        ArticleAPI->>Redis: Cache article with dir/lang
        ArticleAPI-->>Frontend: Article with dir/lang
    else Jina Fetch
        Frontend->>JinaAPI: GET /api/jina?url=...
        JinaAPI->>Redis: Check cache
        
        alt Cache Hit
            Redis-->>JinaAPI: Cached article
        else Cache Miss
            JinaAPI->>JinaReader: Fetch from Jina
            JinaReader-->>JinaAPI: Article content
        end
        
        JinaAPI->>RTLLib: getTextDirection(null, textContent)
        RTLLib->>RTLLib: detectTextDirection(textContent)
        RTLLib-->>JinaAPI: 'rtl' or 'ltr'
        
        JinaAPI->>Redis: Cache article with dir
        JinaAPI-->>Frontend: Article with dir
    end

    Frontend->>Frontend: Render with dir/lang attributes
    Frontend->>User: Display article (RTL or LTR)
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (7)
tsconfig.json (1)

2-4: Global Bun typings are fine, but consider scoping to tests if needed

Adding "types": ["bun-types"] ensures Bun globals and bun:test are typed across the project. If you ever run into conflicts with Node/Next types, consider moving Bun typings into a dedicated test/build tsconfig instead of the main one.

Dockerfile (1)

1-11: Bun-based build looks good; consider pinning image version and/or multi-stage build

The Bun migration (cached bun install + bun run build + bun run start) is clear and should work fine. For better reproducibility and smaller images, consider:

  • Pinning oven/bun to a specific version instead of latest.
  • Optionally using a multi-stage build (build in one stage, run in a slimmer runtime stage) if image size becomes a concern.
app/globals.css (1)

179-221: RTL base styles are well-scoped and align with expected behavior

The new [dir="rtl"] rules (text alignment, blockquotes, lists, and forcing code/pre back to LTR) are appropriately scoped in the base layer and should play nicely with Tailwind utilities. This is a solid foundation for RTL rendering.

lib/rtl.test.ts (1)

1-108: Strong RTL test coverage; minor describe/test mismatch for Persian case

The test suite thoroughly exercises isRTLLanguage, detectTextDirection, and getTextDirection for RTL/LTR, mixed content, and null/empty inputs—nice coverage.

One small nit: inside describe('getTextDirection'), the "handles Persian text" test calls detectTextDirection directly. For clarity, consider either:

  • Moving that test into the detectTextDirection describe block, or
  • Changing it to assert getTextDirection('fa', persianText) so it matches the surrounding describe name.

Functionally everything is correct; this is just a structure/readability tweak.

lib/validation/url.test.ts (1)

4-195: URL tests are thorough; consider adding coverage for uppercase HTTP(S) schemes

This suite does a great job exercising normalization (including malformed and duplicate protocols), validation, and the Zod schema across many realistic cases.

To backstop the case-insensitivity fix suggested in lib/validation/url.ts, it would be useful to add a couple of expectations like:

it("treats uppercase protocols as valid", () => {
  expect(normalizeUrl("HTTPS://example.com")).toBe("https://example.com");
  expect(normalizeUrl("HTTP://example.com")).toBe("http://example.com");
  expect(isValidUrl("HTTPS://example.com")).toBe(true);
});

That will ensure regressions around scheme casing are caught by tests.

Also applies to: 197-246

lib/rtl.ts (1)

72-74: Sampling strategy note.

Sampling only the first 10,000 characters may not be representative for articles where RTL content appears predominantly in the body rather than the lead. Consider whether this is acceptable for your use case, or if random/distributed sampling would be more robust.

app/api/jina/route.ts (1)

173-183: Redundant dir assignment and inconsistent lang handling.

On lines 177-180:

  1. ...articleWithDir already includes dir: articleDir, then line 179 sets dir: articleDir again (redundant).
  2. Line 180 hardcodes lang: "" which overwrites the lang from articleWithDir.

Consider cleaning up:

          article: {
            ...articleWithDir,
            byline: article.byline || "",
-           dir: articleDir,
-           lang: "",
+           lang: articleWithDir.lang || "",
            publishedTime: article.publishedTime || null,
            htmlContent: article.content,
          },
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d68c712 and d6ade63.

⛔ Files ignored due to path filters (2)
  • bun.lock is excluded by !**/*.lock
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (15)
  • Dockerfile (1 hunks)
  • app/api/article/route.ts (11 hunks)
  • app/api/jina/route.ts (6 hunks)
  • app/globals.css (1 hunks)
  • components/ai/response.tsx (4 hunks)
  • components/article/content.tsx (2 hunks)
  • components/features/summary-form.tsx (1 hunks)
  • components/marketing/ad-spot.tsx (1 hunks)
  • lib/rtl.test.ts (1 hunks)
  • lib/rtl.ts (1 hunks)
  • lib/validation/url.test.ts (1 hunks)
  • lib/validation/url.ts (3 hunks)
  • package.json (1 hunks)
  • tsconfig.json (1 hunks)
  • types/api.ts (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
lib/rtl.test.ts (1)
lib/rtl.ts (3)
  • isRTLLanguage (45-51)
  • detectTextDirection (66-104)
  • getTextDirection (110-121)
components/features/summary-form.tsx (1)
components/ai/response.tsx (1)
  • Response (377-420)
lib/validation/url.test.ts (1)
lib/validation/url.ts (3)
  • normalizeUrl (49-69)
  • isValidUrl (74-81)
  • NormalizedUrlSchema (87-104)
app/api/jina/route.ts (2)
lib/rtl.ts (1)
  • getTextDirection (110-121)
types/api.ts (1)
  • ArticleResponseSchema (49-56)
app/api/article/route.ts (1)
lib/rtl.ts (1)
  • getTextDirection (110-121)
🪛 ast-grep (0.40.0)
components/article/content.tsx

[warning] 300-300: Usage of dangerouslySetInnerHTML detected. This bypasses React's built-in XSS protection. Always sanitize HTML content using libraries like DOMPurify before injecting it into the DOM to prevent XSS attacks.
Context: dangerouslySetInnerHTML
Note: [CWE-79] Improper Neutralization of Input During Web Page Generation [REFERENCES]
- https://reactjs.org/docs/dom-elements.html#dangerouslysetinnerhtml
- https://cwe.mitre.org/data/definitions/79.html

(react-unsafe-html-injection)

🔇 Additional comments (24)
package.json (1)

88-107: bun-types devDependency wiring looks appropriate

Adding bun-types as a devDependency aligns with using Bun tooling and bun:test while keeping runtime dependencies clean. No issues from a build/tooling perspective.

components/marketing/ad-spot.tsx (1)

3-6: Lucide icon imports are now minimal and consistent with usage

The lucide-react import list matches actual icon usage (no unused icons like Zap), keeping the bundle clean.

lib/rtl.ts (5)

6-22: RTL languages set looks comprehensive.

The set covers major RTL languages. Note that Kurdish (ku) and Hausa (ha) are bidirectional scripts - they can be written in both Arabic (RTL) and Latin (LTR) scripts. The current approach of treating them as RTL by default is reasonable for content analysis but may produce false positives for Latin-script Kurdish/Hausa content.


45-51: LGTM!

The locale normalization correctly handles both hyphen and underscore separators (e.g., ar-SA, ar_SA), and the null-check is appropriate.


56-60: LGTM!

The implementation is correct and efficient for the small number of RTL ranges.


110-121: LGTM!

The function correctly prioritizes explicit RTL language codes while falling back to content analysis. The behavior of still analyzing content when a non-RTL language code is provided (e.g., en) is reasonable as it allows detection of mixed-content scenarios.


126-136: LGTM!

Clean implementation with proper conditional property inclusion.

components/features/summary-form.tsx (1)

208-211: LGTM!

The RTL props are correctly propagated to the Response component, with appropriate fallback values that align with the article schema defaults.

types/api.ts (1)

29-29: Verify intended behavior with .nullable().optional().default('ltr').

With this chain, .default('ltr') only applies when the field is undefined or missing. If dir: null is explicitly passed, it will remain null (not default to 'ltr').

If null should also resolve to 'ltr', consider using .nullish().default('ltr') or applying the default after parsing. Otherwise, components consuming this schema should handle null values explicitly (as they currently do with dir || 'ltr' fallbacks).

components/article/content.tsx (1)

70-74: LGTM!

The header container correctly propagates article direction and language attributes for proper RTL rendering. The fallbacks align with the schema defaults.

app/api/jina/route.ts (4)

22-24: LGTM!

The CachedArticleSchema correctly adds optional lang and dir fields with proper types matching the ArticleSchema constraints.


71-72: LGTM!

The GET path correctly falls back to computing direction when not cached, using both the language code and text content for accurate detection.


197-198: LGTM!

The existing-cache path correctly computes direction as a fallback, consistent with the GET handler.


210-220: LGTM!

The error fallback path correctly computes and includes direction, ensuring RTL support even when caching fails.

components/ai/response.tsx (2)

203-206: LGTM!

The type definitions for RTL support are well-documented and correctly typed. The optional nature is appropriate since language metadata may not always be available.


386-402: LGTM!

The dir and lang props are correctly propagated to the root container div, enabling proper RTL rendering and language accessibility for the markdown content.

app/api/article/route.ts (8)

12-12: LGTM!

Import of getTextDirection aligns with the RTL support pattern used across the codebase.


26-27: LGTM!

Schema addition for lang is appropriately optional and nullable for backward compatibility with existing cached data.


40-41: LGTM!

The dir field is correctly constrained to the enum ['rtl', 'ltr'] and both fields are optional/nullable, ensuring backward compatibility with existing cache entries.


223-250: LGTM!

The language extraction logic properly checks multiple sources (HTML lang, xml:lang, Readability's extracted lang) with appropriate fallbacks. The text direction detection correctly uses both the language code and content analysis via getTextDirection.


332-349: LGTM!

Direction detection and field population follow the same consistent pattern as the smry-fast path. The enhanced logging with lang and dir improves observability.


455-456: LGTM!

The fallback to getTextDirection for missing dir values ensures backward compatibility with cached articles that predate this change.


539-540: LGTM!

Consistent dir/lang population pattern maintained across the fallback response path.


562-563: LGTM!

Consistent dir/lang population in the main success response path.

Comment on lines +610 to +617
dir: validatedArticle.dir || getTextDirection(validatedArticle.lang, validatedArticle.textContent),
lang: validatedArticle.lang || "",
content: validatedArticle.content,
textContent: validatedArticle.textContent,
length: validatedArticle.length,
siteName: validatedArticle.siteName,
publishedTime: validatedArticle.publishedTime || null,
image: validatedArticle.image || null,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for ArticleResponseSchema definition
rg -n "ArticleResponseSchema" --type=ts -A 40

Repository: mrmps/SMRY

Length of output: 29677


🏁 Script executed:

#!/bin/bash
# Find ArticleSchema definition
rg -n "ArticleSchema\s*=" --type=ts -B 2 -A 20 | head -100

Repository: mrmps/SMRY

Length of output: 5544


Response object inconsistency: image field presence varies across paths.

The response at line 617 includes image: validatedArticle.image || null, but similar response constructions at lines 533-549 and 556-572 omit this field. While ArticleSchema defines image as optional, this inconsistency means clients will receive different response structures depending on which code path executes (cache hit, post-fetch validation, or error handler).

Add the image field to the response blocks at lines 533-549 and 556-572 for consistency.

🤖 Prompt for AI Agents
In app/api/article/route.ts around lines 533-549, 556-572 and 610-617, the
responses are inconsistent: the block at 610-617 includes image:
validatedArticle.image || null but the earlier response blocks (533-549 and
556-572) omit image; add image: validatedArticle.image || null to the response
objects in the 533-549 and 556-572 blocks so all code paths return the same
shape (use the same null fallback as the existing block) and keep field ordering
consistent with the other response properties.

Comment on lines +164 to +166
// Detect text direction for the incoming article
const articleDir = getTextDirection(null, article.textContent);
const articleWithDir = { ...article, dir: articleDir, lang: article.lang || null };
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

article.lang doesn't exist on the input schema.

The JinaCacheUpdateSchema (defined in types/api.ts, lines 92-104) doesn't include a lang field. The expression article.lang || null on line 166 will always evaluate to null because article.lang is undefined.

Either add lang to JinaCacheUpdateSchema if callers can provide it, or simplify:

-      const articleWithDir = { ...article, dir: articleDir, lang: article.lang || null };
+      const articleWithDir = { ...article, dir: articleDir, lang: null };
#!/bin/bash
# Verify JinaCacheUpdateSchema definition
echo "=== JinaCacheUpdateSchema definition ==="
rg -n -A15 'JinaCacheUpdateSchema' types/api.ts
🤖 Prompt for AI Agents
In app/api/jina/route.ts around lines 164 to 166, the code reads article.lang ||
null but JinaCacheUpdateSchema does not define a lang field so article.lang will
always be undefined; either (A) add an optional lang?: string | null to
JinaCacheUpdateSchema in types/api.ts (update the type at lines ~92-104 and
ensure any callers populate it where appropriate), or (B) simplify the route
code to stop referencing article.lang and set lang: null (or omit lang entirely)
when constructing articleWithDir; pick A if callers can/should supply language,
otherwise pick B and remove the redundant check.

);
},
(prevProps, nextProps) => prevProps.children === nextProps.children
(prevProps, nextProps) => prevProps.children === nextProps.children && prevProps.dir === nextProps.dir
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Memoization predicate missing lang comparison.

The lang prop is passed to the div but not included in the memoization check. If lang changes while children and dir remain the same, the component won't re-render, leaving a stale lang attribute on the DOM element.

-  (prevProps, nextProps) => prevProps.children === nextProps.children && prevProps.dir === nextProps.dir
+  (prevProps, nextProps) => prevProps.children === nextProps.children && prevProps.dir === nextProps.dir && prevProps.lang === nextProps.lang
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
(prevProps, nextProps) => prevProps.children === nextProps.children && prevProps.dir === nextProps.dir
(prevProps, nextProps) => prevProps.children === nextProps.children && prevProps.dir === nextProps.dir && prevProps.lang === nextProps.lang
🤖 Prompt for AI Agents
In components/ai/response.tsx around line 419, the memoization predicate only
compares children and dir but omits lang, so when lang changes the component can
stay memoized and the DOM's lang attribute becomes stale; update the predicate
to also compare prevProps.lang === nextProps.lang (i.e., return
prevProps.children === nextProps.children && prevProps.dir === nextProps.dir &&
prevProps.lang === nextProps.lang) so the component re-renders when lang
changes.

Comment on lines 297 to 302
<div
className="mt-6 wrap-break-word prose dark:prose-invert max-w-none"
dir={data.article.dir || 'ltr'}
lang={data.article.lang || undefined}
dangerouslySetInnerHTML={{ __html: data.article.content }}
/>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check for sanitization utilities and DOMPurify usage
echo "=== Checking for DOMPurify and sanitization imports ==="
rg -n 'DOMPurify|sanitize|purify' --type=ts --type=tsx

echo ""
echo "=== Checking article-related files ==="
fd -e ts -e tsx 'article' | head -20

echo ""
echo "=== Checking for readability processing ==="
rg -n 'readability|Readability' --type=ts --type=tsx

Repository: mrmps/SMRY

Length of output: 293


🏁 Script executed:

#!/bin/bash
# Search without file type filter
echo "=== Checking for sanitization ==="
rg -n 'DOMPurify|sanitize|purify|xss|escapeHtml|strip' --glob='*.ts' --glob='*.tsx'

echo ""
echo "=== Reading use-articles.ts ==="
cat -n lib/hooks/use-articles.ts

echo ""
echo "=== Finding article API routes ==="
find . -type f \( -name '*.ts' -o -name '*.tsx' \) -path '*/routes/*' -o -path '*/api/*' | grep -i article

Repository: mrmps/SMRY

Length of output: 5161


🏁 Script executed:

#!/bin/bash
# Read jina.ts to see escapeHtml usage
echo "=== Reading lib/api/jina.ts ==="
cat -n lib/api/jina.ts

echo ""
echo "=== Reading app/api/article/route.ts ==="
cat -n app/api/article/route.ts

Repository: mrmps/SMRY

Length of output: 32659


🏁 Script executed:

#!/bin/bash
# Check for Content Security Policy or other security headers
echo "=== Checking for CSP or security headers in Next.js config ==="
find . -type f \( -name 'next.config.*' -o -name 'middleware.ts' \) | head -10 | xargs cat 2>/dev/null

echo ""
echo "=== Checking package.json for marked and readability versions ==="
grep -A 2 '"marked"\|"readability"\|"dompurify"' package.json

Repository: mrmps/SMRY

Length of output: 699


Add HTML sanitization before rendering data.article.content with dangerouslySetInnerHTML.

Article content from all sources (Jina.ai, Readability, Diffbot) is currently rendered without sanitization. The marked markdown parser, Mozilla Readability, and Diffbot all output unsanitized HTML, and the Jina fallback's escapeHtml() only encodes entities without removing dangerous tags/attributes. Without DOMPurify or similar, the application is vulnerable to XSS attacks. Integrate a sanitization library (e.g., DOMPurify) in the article processing pipeline before the content reaches this component.

The dir/lang additions are correct and improve accessibility.

🧰 Tools
🪛 ast-grep (0.40.0)

[warning] 300-300: Usage of dangerouslySetInnerHTML detected. This bypasses React's built-in XSS protection. Always sanitize HTML content using libraries like DOMPurify before injecting it into the DOM to prevent XSS attacks.
Context: dangerouslySetInnerHTML
Note: [CWE-79] Improper Neutralization of Input During Web Page Generation [REFERENCES]
- https://reactjs.org/docs/dom-elements.html#dangerouslysetinnerhtml
- https://cwe.mitre.org/data/definitions/79.html

(react-unsafe-html-injection)

🪛 Biome (2.1.2)

[error] 301-301: Avoid passing content using the dangerouslySetInnerHTML prop.

Setting content using code can expose users to cross-site scripting (XSS) attacks

(lint/security/noDangerouslySetInnerHtml)

🤖 Prompt for AI Agents
In components/article/content.tsx around lines 297–302, the component currently
injects data.article.content directly via dangerouslySetInnerHTML, which is
vulnerable to XSS because content from Jina/Readability/Diffbot is unsanitized;
integrate HTML sanitization (e.g., DOMPurify) in the article processing pipeline
or just before rendering: install/import a sanitizer, run data.article.content
through the sanitizer to remove dangerous tags/attributes and then pass the
sanitized HTML to dangerouslySetInnerHTML (or store sanitized content on the
article object earlier), ensuring the sanitized string is what the component
renders while leaving dir/lang handling unchanged.

Comment on lines +4 to 6
// Valid protocol: http:// or https://
const VALID_PROTOCOL_REGEX = /^https?:\/\//;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "url.ts" -o -name "url.test.ts" | head -20

Repository: mrmps/SMRY

Length of output: 106


🏁 Script executed:

cat -n lib/validation/url.ts

Repository: mrmps/SMRY

Length of output: 3593


🏁 Script executed:

cat -n lib/validation/url.test.ts

Repository: mrmps/SMRY

Length of output: 10400


🏁 Script executed:

# Check if validator.js documentation or source is available in node_modules
find . -type d -name "validator" 2>/dev/null | head -5

Repository: mrmps/SMRY

Length of output: 36


🏁 Script executed:

# Look for validator package info
fd -t f "package.json" | xargs grep -l "validator" | head -3

Repository: mrmps/SMRY

Length of output: 66


🏁 Script executed:

cat package.json | grep -A 5 "validator"

Repository: mrmps/SMRY

Length of output: 410


Add case-insensitive flag to protocol regex to handle uppercase schemes

The VALID_PROTOCOL_REGEX is case-sensitive while the cleanProtocol function uses case-insensitive regexes, creating an inconsistency. Uppercase schemes like "HTTPS://example.com" will fail the check at line 60, causing normalizeUrl to incorrectly prepend https://, resulting in "https://HTTPS://example.com" that fails validation.

Add the /i flag to match the case-insensitive handling in cleanProtocol:

-const VALID_PROTOCOL_REGEX = /^https?:\/\//;
+const VALID_PROTOCOL_REGEX = /^https?:\/\//i;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Valid protocol: http:// or https://
const VALID_PROTOCOL_REGEX = /^https?:\/\//;
// Valid protocol: http:// or https://
const VALID_PROTOCOL_REGEX = /^https?:\/\//i;
🤖 Prompt for AI Agents
In lib/validation/url.ts around lines 4 to 6, the VALID_PROTOCOL_REGEX is
currently case-sensitive and will not match uppercase schemes (e.g.,
"HTTPS://"), causing normalizeUrl to prepend another protocol; update the regex
to be case-insensitive by adding the /i flag (e.g., /^https?:\/\//i) so it
matches "http"/"https" in any case and keep the existing comment or adjust it if
needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants