RSS River for Elasticsearch

Welcome to the RSS River Plugin for Elasticsearch

Versions

RSS River Plugin	ElasticSearch
0.3.0-SNAPSHOT (master)	0.90.4
0.1.0	0.90.0-0.90.3
0.0.6	0.19
0.0.5	0.18
0.0.4	0.18
0.0.3	0.18
0.0.2	0.17

Build Status

Thanks to cloudbees for the build status :

Getting Started

Installation

Just type :

$ bin/plugin -install fr.pilato.elasticsearch.river/rssriver/0.1.0

This will do the job...

-> Installing fr.pilato.elasticsearch.river/rssriver/0.1.0...
Trying http://download.elasticsearch.org/fr.pilato.elasticsearch.river/rssriver/rssriver-0.1.0.zip...
Trying http://search.maven.org/remotecontent?filepath=fr/pilato/elasticsearch/rssriver/rssriver/0.1.0/fsriver-0.1.0.zip...
Trying https://oss.sonatype.org/service/local/repositories/releases/content/fr/pilato/elasticsearch/river/rssriver/0.1.0/rssriver-0.1.0.zip...
Downloading ......DONE
Installed rssriver

Creating a RSS river

We create first an index to store all the feed documents :

$ curl -XPUT 'localhost:9200/lemonde/' -d '{}'

We create the river with the following properties :

Feed URL : http://www.lemonde.fr/rss/une.xml

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
    	"name": "lemonde",
    	"url": "http://www.lemonde.fr/rss/une.xml"
    	}
    ]
  }
}'

This RSS feed follows RSS 2.0 specifications and provide a ttl entry. The update rate will be auto-adjusted following this value.

If you want to set your own refresh rate (if not provided) and force it (even if it's provided), use update_rate and ignore_ttl options:

We create the river with the following properties :

Feed URL: http://www.lemonde.fr/rss/une.xml
Update Rate: every 15 minutes (15 * 60 * 1000 = 900000 ms)
Ignore TTL : true

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
    	"name": "lemonde",
    	"url": "http://www.lemonde.fr/rss/une.xml",
    	"update_rate": 900000,
    	"ignore_ttl": true
    	}
    ]
  }
}'

If you need to get multiple feeds, you can add them :

Feed1

URL : http://www.lemonde.fr/rss/une.xml
Update Rate1 : every 15 minutes (15 * 60 * 1000 = 900000 ms) (will be modified by provided TTL)

Feed2

URL : http://rss.lefigaro.fr/lefigaro/laune
Update Rate2 : every 30 minutes (30 * 60 * 1000 = 1800000 ms)
Ignore TTL : true

$ curl -XPUT 'localhost:9200/actus/' -d '{}'

$ curl -XPUT 'localhost:9200/_river/actus/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
			"name": "lemonde",
			"url": "http://www.lemonde.fr/rss/une.xml",
			"update_rate": 900000
    	}, {
			"name": "lefigaro",
			"url": "http://rss.lefigaro.fr/lefigaro/laune",
			"update_rate": 1800000,
			"ignore_ttl": true
    	}
    ]
  }
}'

Working with mappings

When you create your index, you can specify the mapping you want to use as follow :

$ curl -XPUT 'http://localhost:9200/lefigaro/' -d '{}'

$ curl -XPUT 'http://localhost:9200/lefigaro/page/_mapping' -d '{
    "page" : {
        "properties" : {
            "feedname" : {"type" : "string"},
            "title" : {"type" : "string", "analyzer" : "french"},
            "description" : {"type" : "string", "analyzer" : "french"},
            "author" : {"type" : "string"},
            "link" : {"type" : "string"}
        }
    }
}'

Then, your feed will use it when you create the river :

$ curl -XPUT 'localhost:9200/_river/lefigaro/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
		    "url": "http://rss.lefigaro.fr/lefigaro/laune"
	    }
    ]
  }
}'

HTTP proxy

A HTTP proxy can be defined with proxyhost and proxyport :

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "proxyhost" : "proxyserver.domain",
    "proxyport" : 3128,
    "feeds" : [ {
    	"name": "lemonde",
    	"url": "http://www.lemonde.fr/rss/une.xml"
    	}
    ]
  }
}'

Behind the scene

RSS river downloads RSS feed every update_rate milliseconds and check if there is new messages.

At first, RSS river look at the <channel> tag. It reads the optional <pubDate> tag and store it in Elastic Search to compare it on next launch.

Then, for each <item> tag, RSS river creates a new document within page type with the following properties :

XML Tag	ES Mapping
<title>	title
<description>	description
<author>	author
<link>	link
<geo:lat> <geo:long>	location

ID is generated from description using the UUID generator. So, each message is indexed only once. Read RSS 2.0 Specification for more details about RSS channels.

To Do List

Many many things to do :

As <pubDate> tag is optional, we have to check if RSS River is working in that case and parse each feed message
Support more RSS <channel> sub-elements, such as <category>, <skipDays>, <skipHours>
Support more RSS <item> sub-elements, such as <category>, <enclosure>, <pubDate>
Support for multi-channel (one per language for instance)
Use <guid> as the text to encode to generate ID

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
_site		_site
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RSS River for Elasticsearch

Versions

Build Status

Getting Started

Installation

Creating a RSS river

Working with mappings

HTTP proxy

Behind the scene

To Do List

About

Uh oh!

Releases

Packages

Languages

djanssen/rssriver

Folders and files

Latest commit

History

Repository files navigation

RSS River for Elasticsearch

Versions

Build Status

Getting Started

Installation

Creating a RSS river

Working with mappings

HTTP proxy

Behind the scene

To Do List

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages