Skip to content

djanssen/rssriver

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RSS River for Elasticsearch

Welcome to the RSS River Plugin for Elasticsearch

Versions

RSS River Plugin ElasticSearch
0.3.0-SNAPSHOT (master) 0.90.4
0.1.0 0.90.0-0.90.3
0.0.6 0.19
0.0.5 0.18
0.0.4 0.18
0.0.3 0.18
0.0.2 0.17

Build Status

Thanks to cloudbees for the build status : build status

Getting Started

Installation

Just type :

$ bin/plugin -install fr.pilato.elasticsearch.river/rssriver/0.1.0

This will do the job...

-> Installing fr.pilato.elasticsearch.river/rssriver/0.1.0...
Trying http://download.elasticsearch.org/fr.pilato.elasticsearch.river/rssriver/rssriver-0.1.0.zip...
Trying http://search.maven.org/remotecontent?filepath=fr/pilato/elasticsearch/rssriver/rssriver/0.1.0/fsriver-0.1.0.zip...
Trying https://oss.sonatype.org/service/local/repositories/releases/content/fr/pilato/elasticsearch/river/rssriver/0.1.0/rssriver-0.1.0.zip...
Downloading ......DONE
Installed rssriver

Creating a RSS river

We create first an index to store all the feed documents :

$ curl -XPUT 'localhost:9200/lemonde/' -d '{}'

We create the river with the following properties :

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
    	"name": "lemonde",
    	"url": "http://www.lemonde.fr/rss/une.xml"
    	}
    ]
  }
}'

This RSS feed follows RSS 2.0 specifications and provide a ttl entry. The update rate will be auto-adjusted following this value.

If you want to set your own refresh rate (if not provided) and force it (even if it's provided), use update_rate and ignore_ttl options:

We create the river with the following properties :

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
    	"name": "lemonde",
    	"url": "http://www.lemonde.fr/rss/une.xml",
    	"update_rate": 900000,
    	"ignore_ttl": true
    	}
    ]
  }
}'

If you need to get multiple feeds, you can add them :

Feed1

Feed2

$ curl -XPUT 'localhost:9200/actus/' -d '{}'

$ curl -XPUT 'localhost:9200/_river/actus/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
			"name": "lemonde",
			"url": "http://www.lemonde.fr/rss/une.xml",
			"update_rate": 900000
    	}, {
			"name": "lefigaro",
			"url": "http://rss.lefigaro.fr/lefigaro/laune",
			"update_rate": 1800000,
			"ignore_ttl": true
    	}
    ]
  }
}'

Working with mappings

When you create your index, you can specify the mapping you want to use as follow :

$ curl -XPUT 'http://localhost:9200/lefigaro/' -d '{}'

$ curl -XPUT 'http://localhost:9200/lefigaro/page/_mapping' -d '{
    "page" : {
        "properties" : {
            "feedname" : {"type" : "string"},
            "title" : {"type" : "string", "analyzer" : "french"},
            "description" : {"type" : "string", "analyzer" : "french"},
            "author" : {"type" : "string"},
            "link" : {"type" : "string"}
        }
    }
}'

Then, your feed will use it when you create the river :

$ curl -XPUT 'localhost:9200/_river/lefigaro/_meta' -d '{
  "type": "rss",
  "rss": {
    "feeds" : [ {
		    "url": "http://rss.lefigaro.fr/lefigaro/laune"
	    }
    ]
  }
}'

HTTP proxy

A HTTP proxy can be defined with proxyhost and proxyport :

$ curl -XPUT 'localhost:9200/_river/lemonde/_meta' -d '{
  "type": "rss",
  "rss": {
    "proxyhost" : "proxyserver.domain",
    "proxyport" : 3128,
    "feeds" : [ {
    	"name": "lemonde",
    	"url": "http://www.lemonde.fr/rss/une.xml"
    	}
    ]
  }
}'

Behind the scene

RSS river downloads RSS feed every update_rate milliseconds and check if there is new messages.

At first, RSS river look at the <channel> tag. It reads the optional <pubDate> tag and store it in Elastic Search to compare it on next launch.

Then, for each <item> tag, RSS river creates a new document within page type with the following properties :

XML Tag ES Mapping
<title> title
<description> description
<author> author
<link> link
<geo:lat> <geo:long> location

ID is generated from description using the UUID generator. So, each message is indexed only once. Read RSS 2.0 Specification for more details about RSS channels.

To Do List

Many many things to do :

  • As <pubDate> tag is optional, we have to check if RSS River is working in that case and parse each feed message
  • Support more RSS <channel> sub-elements, such as <category>, <skipDays>, <skipHours>
  • Support more RSS <item> sub-elements, such as <category>, <enclosure>, <pubDate>
  • Support for multi-channel (one per language for instance)
  • Use <guid> as the text to encode to generate ID

About

ElasticSearch Rss River

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 61.2%
  • CSS 20.2%
  • JavaScript 18.6%