some-programs · dsteinkopf · Oct 2, 2022 · Oct 2, 2022 · Aug 22, 2024 · Aug 22, 2024
diff --git a/.gitignore b/.gitignore
@@ -8,3 +8,4 @@ build/
 include/
 .vagrant/
 .DS_Store
+venv/
diff --git a/.vscode/launch.json b/.vscode/launch.json
@@ -0,0 +1,16 @@
+{
+  // Use IntelliSense to learn about possible attributes.
+  // Hover to view descriptions of existing attributes.
+  // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
+  "version": "0.2.0",
+  "configurations": [
+    {
+      "name": "Python Debugger: Dirks",
+      "type": "debugpy",
+      "request": "launch",
+      "program": "${file}",
+      "console": "integratedTerminal",
+      "args": "-v wordpress-xml/nerdblog.wp.xml"
+    }
+  ]
+}
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -0,0 +1,81 @@
+# ExitWP for Hugo
+
+## Convert WordPress and Squarespace exports to the [Hugo static site generator](https://gohugo.io/)
+
+This is an updated version of the ExitWP tool, originally created by Thomas Frössman for Jekyll and later adapted for Hugo by Arjan Wooning.
+
+For a detailed guide and background information, visit [Arjan Wooning's website](https://arjan.wooning.cz/conversion-tools-from-wordpress-to-hugo/#final-solution-exitwp-for-hugo).
+
+ExitWP is a tool designed to simplify the migration process from one or more WordPress blogs, or other blogs/websites exported to the WordPress XML format, to the [Hugo static site generator](https://gohugo.io/). It aims to convert as much information as possible from the WordPress export, with options to filter the converted data.
+[SquareSpace](https://squarespace.com/) also offers the option to [export your site as WordPress formatted XML file(s)](https://support.squarespace.com/hc/en-us/articles/206566687-Exporting-your-site?platform=v6&websiteId=5974c4a71b631b9a769048c6).
+
+## Features
+
+- Converts WordPress export XML to Hugo-compatible Markdown or HTML
+- Downloads and processes images within posts
+- Supports inclusion of comments from WordPress posts
+- Handles tags and categories for Hugo
+- Flexible configuration options via `config.yaml`
+
+Please refer to the [Release notes](RELEASE_NOTES.md) (RELEASE_NOTES.md) for an overview of changes and updates.
+
+## Getting Started
+
+1. Clone the repository: `git clone https://github.com/wooni005/exitwp-for-hugo.git`
+2. Export your WordPress blog(s) using the WordPress exporter (Tools > Export in WordPress admin). Other website hosting sites, like [SquareSpace](https://squarespace.com/) also offer the option to export your site as WordPress formatted XML file(s).
+3. Place all WordPress XML files in the `wordpress-xml` directory
+4. Configure the tool by editing `config.yaml`
+5. Run the converter: `python3 exitwp.py`
+6. Optionally, if the script runs into issues, or the output does not appear to be correct, run `xmllint` [part of Libxml2](https://en.wikipedia.org/wiki/Libxml2) on your export file(s) and fix any errors.
+7. Your converted blog(s) will be in separate directories under the `build` directory, specified in `config.yaml`.
+
+## Dependencies
+
+- Python 3.x
+- markdownify
+- PyYAML
+- Beautiful Soup 4
+
+## Installing Dependencies
+
+```bash
+pip3 install -r requirements.txt
+```
+
+## Configuration
+
+Refer to the `config.yaml` file for all configurable options. Key settings include:
+
+- `wp_exports`: Directory containing WordPress export XML files
+- `build_dir`: Target directory for output
+- `download_images`: Whether to download and relocate images
+- `include_comments`: Option to include comments in the exported content
+- `target_format`: Choose between 'markdown' or 'html' output
+- `image_settings`: Configure image processing behavior
+
+## Usage
+
+Basic usage:
+
+```bash
+python3 exitwp.py
+```
+
+For verbose output:
+
+```bash
+python3 exitwp.py -v
+```
+
+## Known Issues and Limitations
+
+- Potential issues with non-UTF-8 encoded WordPress dump files
+- Image downloading may fail for some URLs due to various reasons (404 errors, timeouts, etc.)
+
+## Support
+
+This tool is not actively maintained. For support or custom modifications, consider using AI chatbots like ChatGPT or Claude.
+
+## Contributing
+
+If you've made significant improvements to the tool, feel free to submit a pull request.
diff --git a/README.rst b/README.rst
diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -0,0 +1,60 @@
+# Changelog and Release Notes
+
+# August 2024
+
+## exitwp.py
+
+### Major Changes
+- Replaced html2text_file with markdownify for HTML to Markdown conversion
+- Added support for downloading and processing images within posts
+- Implemented comment extraction and inclusion in the output
+- Added support for tags and categories handling
+- Improved error handling and logging
+
+### New Features
+- Image processing: Downloads images, saves them locally, and updates image URLs in the content
+- Comment handling: Extracts and includes comments in the output markdown files
+- Tags and categories: Properly handles WordPress tags and categories, mapping them to Hugo format
+- Timezone handling: Added support for CET timezone
+
+### Improvements
+- Enhanced YAML header generation for Hugo compatibility
+- Improved date parsing and handling
+- Better error logging and verbose output options
+- Refactored code for better readability and maintainability
+
+### Bug Fixes
+- Fixed issues with Unicode handling
+- Addressed potential errors in parsing XML and HTML content
+
+## config.yaml
+
+### New Options
+- Added `tags_label` option to specify the label for tags/categories in the output
+- Introduced `include_comments` option to control whether comments are included in the export
+
+### Changes
+- Refined `taxonomies` configuration to better handle tags and categories
+- Updated `body_replace` patterns for improved content transformation
+
+### Improvements
+- Added more detailed comments and explanations for configuration options
+
+## Overall Improvements
+
+1. Better Hugo Compatibility: The updated script now generates output more closely aligned with Hugo's expectations.
+2. Enhanced Image Handling: Improved downloading and processing of images within posts.
+3. Comment Support: Added the ability to include WordPress comments in the exported content.
+4. Improved Taxonomy Handling: Better management of tags and categories for Hugo.
+5. More Flexible Configuration: Additional options in config.yaml for finer control over the export process.
+
+## Upgrade Notes
+
+When upgrading to this new version:
+
+1. Review the new configuration options in config.yaml and adjust as needed for your use case.
+2. Be aware of the change from html2text to markdownify for HTML to Markdown conversion.
+3. Test the script with a small subset of your content first to ensure compatibility with your specific WordPress export.
+4. Pay attention to the new image handling and comment inclusion features, adjusting settings as necessary.
+
+This update significantly improves the WordPress to Hugo migration process, offering more features and better compatibility with Hugo's content structure.
diff --git a/Vagrantfile b/Vagrantfile
diff --git a/config.yaml b/config.yaml
@@ -1,19 +1,71 @@
+# Tell me what's going on.. can also pass command line argument -v
+verbose: False
+
 # The directory where exitwp looks for wordpress export xml files.
 wp_exports: wordpress-xml
 
 # The target directory where all output is saved.
 build_dir: build
 
 # Output format: primary choices are html or markdown.
+# Some functions, like the inclusion of comments, only output in markdown,
+# and may look not as expected in html.
 target_format: markdown
 
 # The date format of the wikipedia export file.
-# I'm not sure if this ever differs depending on wordpress localization.
-# Wordpress is often so full of strange quirks so I wouldnt rule it out.
+# I'm not sure if this ever differs depending on WordPress localization.
+# Wordpress is often so full of strange quirks so I wouldn't rule it out.
 date_format: '%Y-%m-%d %H:%M:%S'
 
-# Try to download and reloacate all images locally to the blog.
-download_images: False
+# Try to download and relocate all images locally to the blog.
+download_images: True
+
+# Image URL filtering
+image_settings:
+  # URL parts to exclude when processing images
+  excluded_url_parts:
+    - 'tracking.pixel.com'
+    - 'http://www.assoc-amazon.com/'
+  # Domains to always include when processing images
+  included_domains:
+    - 'nerdblog.steinkopf.net'
+  # Default behavior for image validity when no other conditions are met
+  # Set to true to include images by default, false to exclude by default
+  #
+  # If set to true:
+  #   - All images will be considered valid unless explicitly excluded
+  #   - The 'included_domains' setting will have no effect
+  #
+  # If set to false:
+  #   - Only images from 'included_domains' will be considered valid
+  #   - All other images will be excluded unless explicitly included
+  #   - This can be handy if you want to process only images from your old
+  #     blog for example, but not download images from the public internet
+  #     to your own (new) server.
+  #
+  # Examples:
+  # 1. To process all images except those from specific domains:
+  #    default_image_validity: true
+  #    excluded_url_parts:
+  #      - 'ads.example.com'
+  #      - 'tracking.example.com'
+  #
+  # 2. To process only images from specific domains:
+  #    default_image_validity: false
+  #    included_domains:
+  #      - 'images.mysite.com'
+  #      - 'cdn.mysite.com'
+  #
+  default_image_validity: false
+  # Icon to use when an image is not found. Make sure to put this file in
+  # the right place on your destination server manually.
+  # (This file is not supplied with exitwp, you have to pick one yourself.)
+  not_found_icon: '/icons/question-warning.svg'
+  # Default timeout (in seconds) for image downloads
+  download_timeout: 10
+
+# Include old/existing comments with the post
+include_comments: true
 
 # Item types we don't want to import.
 item_type_filter: {attachment, nav_menu_item}
@@ -22,21 +74,29 @@ item_type_filter: {attachment, nav_menu_item}
 # By default, we're filtering based on field "status" set to "draft"
 item_field_filter: {status: draft}
 
+# Output label for categories or tags.
+# NOTE: This overrides the name_mapping in the taxonomies below!
+# Default will be tags_label: 'categories', as specified in the exitwp.py
+# script, if not defined here.
+# tags_label: 'tags'
+
 taxonomies:
   # Filter taxonomies.
   filter: {}
   # Filter taxonomies entries.
   entry_filter: {category: Uncategorized}
   # Rename taxonomies when writing jekyll output format.
+  # NOTE: categories label is overwritten by the tags_label above!!
   name_mapping: {category: categories, post_tag: tags}
 
 # Replace certain patterns in body
 # Simply replace the key with its value
 body_replace: {
+  # '\(/media/': '(/images/posts/',
   # '<pre.*?lang="(.*?)".*?>': '\n{% codeblock \1 lang:\1 %}\n',
   # '<pre.*?>': '\n{% codeblock %}\n',
   # '</pre>': '\n{% endcodeblock %}\n',
 
-#    '[python]': '{% codeblock lang:python %}',
-#    '[/python]': '{% endcodeblock %}',
-}
+  #    '[python]': '{% codeblock lang:python %}',
+  #    '[/python]': '{% endcodeblock %}',
+  }