Skip to content

pioneer/8coupons

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

8 coupons demo

How to use

Install MongoDB.

Install requirements:

$ pip install -r requirements.txt

Run HTTP API server:

$ make runserver

Run scraper in a separate console:

$ make runscraper

API endpoint:

http://localhost:8080/games?search=<search string>

Requirements

Python 3.5+, MongoDB and the libraries below (see requirements.txt for more details):

  • aiohttp
  • motor
  • requests
  • snowballstemmer
  • BeautifulSoup

Main components

  • HTTP API server
  • MongoDB storage
  • Scraping and indexing background worker

Configuring

Most of the values worth to configure are extracted into settings.py. Feel free to set them up according to your project environment.

How it works

The scraper runs periodically in background and grabs the fresh games data from GiantBomb API. The games added since last run are added to the database, along with updating the search index for each newly appeared game.

The index is built using Snowball stemmer, by extracting stems from each field which is configured in settings to be under indexing. The index is a set of records like {<stem>: [a list of game ids containing this stem]}. Searching against the index is searching for given stems, extracting game ids linked to each stem and returning those games.

Example: we have configured in settings.py that the fields of interest are "name" and "description". Then, two new games come:

{"id": 1,
 "name": "Arkanoid: New Horizons",
 "description": "The reincarnation of well-known Arkanoid game.",
 "image": "http://static.giantbomb.com/images/1.jpg",
 "date_added": "2016-05-30 15:16:51",
 "tags": "arcade,arkanoid,breakout,old,platformer,new"}

{"id": 2,
 "name": "Pacman 3D",
 "description": "Our old friend Pacman appears in 3D with brand new graphics, monsters and labyrinths.",
 "image": "http://static.giantbomb.com/images/2.jpg",
 "date_added": "2016-05-30 15:25:13",
 "tags": "arcade,pacman,3d,old,platformer,new"}

The indexer parses "name" and "description" of both games, omitting all other fields. So, the words like "platformer" and "arcade" from tags will not come into the index.

First game gives "arkanoid", "new", "horizon" stems extracted from name, and "the", "reincarn", "of", "well", "known", "arkanoid", "game" stems extracted from description.

Second game produces "pacman", "3d" and "our", "old", "friend", "pacman", "appear", "in", "3d", "with", "brand", "new", "graphics", "monster", "and", "labyrinth" respectively. Notice the "new" stem appears in both games.

After updating the index, assuming it was initially empty, the index will look like below:

{"stem": "arkanoid", "game_ids": [1]}
{"stem": "new", "game_ids": [1, 2]}
{"stem": "horizon", "game_ids": [1]}
{"stem": "the", "game_ids": [1]}
{"stem": "reincarn", "game_ids": [1]}
{"stem": "of", "game_ids": [1]}
{"stem": "well", "game_ids": [1]}
{"stem": "known", "game_ids": [1]}
{"stem": "game", "game_ids": [1]}
{"stem": "pacman", "game_ids": [2]}
{"stem": "3d", "game_ids": [2]}
{"stem": "our", "game_ids": [2]}
{"stem": "old", "game_ids": [2]}
{"stem": "friend", "game_ids": [2]}
{"stem": "appear", "game_ids": [2]}
{"stem": "in", "game_ids": [2]}
{"stem": "with", "game_ids": [2]}
{"stem": "brand", "game_ids": [2]}
{"stem": "graphics", "game_ids": [2]}
{"stem": "monster", "game_ids": [2]}
{"stem": "and", "game_ids": [2]}
{"stem": "labyrinth", "game_ids": [2]}

Then, if a search appears by using search string "new game":

http://localhost:8080/games?search=new+game

the HTTP API server will look for "new" and "game" stems in the index, which contains [1, 2] for "new" and [1] for "game", so thus resulting in both games in output:

{"games":
    {"id": 1,
     "name": "Arkanoid: New Horizons",
     "description": "The reincarnation of well-known Arkanoid game.",
     "image": "http://static.giantbomb.com/images/1.jpg",
     "date_added": "2016-05-30 15:16:51",
     "tags": "arcade,arkanoid,breakout,old,platformer,new"},
    {"id": 2,
     "name": "Pacman 3D",
     "description": "Our old friend Pacman appears in 3D with brand new graphics, monsters and labyrinths.",
     "image": "http://static.giantbomb.com/images/2.jpg",
     "date_added": "2016-05-30 15:25:13",
     "tags": "arcade,pacman,3d,old,platformer,new"}}

The search against, for example, "monster" will return just the second game, and the search against "horizon" will return only the first one.

Notes

There are both asynchronous and synchronous database usages. Asynchronous is faster and used in HTTP API server, synchronous is used in scraper, due to currently there is just a single thread scraping games data, and adding async stuff along with event loop would add unneeded complexity. This can be changed when a multithreaded scraping approach will be implemented.

Scaling approach

  • nginx as a load balancer
  • several machines with identical HTTP API server and MongoDB contents
  • one of the machines above will serve as a master node, the only node running a scraper, along with propagating the changes across all Mongo databases
  • propagating of the changes, depending on limitations we have, may be implemented using whether MongoDB built-in replication features (master-slave, replicasets), or manually using Apache Kafka or RabbitMQ (we need to improve the scraper in this case in order to send messages and also implement another background process on slave nodes to listen to changes coming from the master)

TODO

  • tests
  • scaling (see Scaling approach)
  • caching the index and some frequently used games data
  • spelling errors processing via Levenstein algorithm
  • working correctly with non-English texts
  • remove words like "the", "of", "with", "in" from using in index
  • more flexible scraping in order to allow extending with another sources
  • multithreaded scraping and totally switching to async db access
  • limit and offset parameters for splitting the results
  • MongoDB indexes
  • bonus points

About

A test task for 8coupons

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published