Skip to content
This repository was archived by the owner on Dec 8, 2020. It is now read-only.

Conversation

@divyanshutomar
Copy link

Article on running background tasks in Python using task queues

@divyanshutomar divyanshutomar changed the title added post content and images background tasks in Python using task queues Aug 1, 2018
Copy link
Contributor

@sidhanthp sidhanthp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really enjoyed the style of the post and think you did a great job with a starter application, but I don't think the reader is left with a strong understanding of Redis Queue's, how to use them, and what's happening internally.

I left you a couple comments, take a look at them and let me know what you think.

---
# Running background tasks in Python using task queues

We often come across problems in our applications where a compute-intensive time-taking task needs to be performed on the server in response to some user activity or request. For a server-side application exposing a REST API, handling this problem is different from common CRUD endpoints where request-response lifecycle is usually short. In this case, the response for such a request may not be available, as it may be not viable to proceed with its execution immediately on the same process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than starting with 'We often come across problems...", it would be more powerful to give an example where this occurs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I think this example is a little convoluted for the 1st paragraph.

For a server-side application exposing a REST API, handling this problem is different from common CRUD endpoints where request-response lifecycle is usually short. In this case, the response for such a request may not be available, as it may be not viable to proceed with its execution immediately on the same process.
-->
When dealing with server-side applications, the response from a REST API may not be available immediately. It would be more efficient to run background tasks than waste the idle CPU cycles.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, I will rework on the introduction


We often come across problems in our applications where a compute-intensive time-taking task needs to be performed on the server in response to some user activity or request. For a server-side application exposing a REST API, handling this problem is different from common CRUD endpoints where request-response lifecycle is usually short. In this case, the response for such a request may not be available, as it may be not viable to proceed with its execution immediately on the same process.

The execution of such tasks or jobs can be performed in the background by some another process spawned for the sole purpose. These processes are usually called _workers_. They run concurrently with the main process (web server in our case) handling client requests. To list of all the tasks which need to be executed, a job/task queue is maintained to store tasks along with its metadata created by incoming requests on the web server. The worker process then executes these tasks chronologically. This modular approach makes it easier for the web server to accommodate execution of such long-running tasks as it will not get blocked itself in doing so. This also means that the web server can respond to forthcoming client requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you meant to write "by another process".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To list all the tasks"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out the grammatical mistakes. 👍


![architectire of web server and queue](./images/running-background-tasks-python/small-archi.png)

Task queues are quite popular among microservices architecture. They enable each microservice to perform its definite task really well and take care of the complexities of inter-microservice communication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean when you say 'definite task'. If I were to guess, you're referring to the idea that each microservice should perform a small task.

Additionally, if you make a claim that a task queue takes 'care of the complexities of inter-microservice communication', I would explain it a little further (unless I'm missing something).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not about a small or big task. I feel 'dedicated task' will make more sense here?

I will add more details here explaining the role of messaging queues in microservices architecture.

![health check](./images/running-background-tasks-python/health-check.png)


### Getting to Know Starter Application
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use this to make the tool names stand out, rather than this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted


### Writing the parser

Let's start with writing a simple parser that accepts a Goodreads book page URL. We will be using requests python library for making an HTTP request to get HTML content of the page. BeautifulSoup is a python library that lets us search, manipulate and create structured markup languages such as HTML, XML, etc. It will create a searchable tree from the fetched page's HTML. This will allow us to retrieve key information like the book title, author, rating, and description.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave out "It will create a searchable tree from the fetched page's HTML."


### Inspecting task queue

The creators of Redis queue (RQ) library have developed another library for checking the state of Redis queue. It is called *rq-dashboard* and it can be integrated with our flask web application. It exposes a flask blueprint for integrating with an existing flask project. It's a browser-based application which shows queues status, workers listening on those queues and jobs queued along with their meta information. Also, it provides triggers for flushing the queue and re-queuing failed jobs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can make this a little more concise and give a brief description of the benefits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you've written about it a little below, I think it might be better to move that up.


Now, we are all set to begin testing our application with some Goodreads URLs. Let's start by making a POST request to `/parseGoodReads` endpoint. Make sure to provide a valid list of URLs in an array as the request body.

![post request](./images/running-background-tasks-python/post-req.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you using to make the post request?

## Conclusion and Takeaways

The above application demonstrates how queuing frameworks like Redis Queue can be leveraged for solving problems that need more time and computing resources. In our application, if the parsing task is executed on the same process where client requests are being processed and served, it can easily become a performance bottleneck with high traffic. The above approach not only helps in avoiding that but also bring modularity to the table. Following are some of the key takeaways you can follow to tackle similar problems:
* Queueing frameworks allows more granular control over scaling of different processes. More worker processes can be spawned if there is an accumulation of a large number of tasks in the queue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"frameworks allow more"


The above application demonstrates how queuing frameworks like Redis Queue can be leveraged for solving problems that need more time and computing resources. In our application, if the parsing task is executed on the same process where client requests are being processed and served, it can easily become a performance bottleneck with high traffic. The above approach not only helps in avoiding that but also bring modularity to the table. Following are some of the key takeaways you can follow to tackle similar problems:
* Queueing frameworks allows more granular control over scaling of different processes. More worker processes can be spawned if there is an accumulation of a large number of tasks in the queue.
* Multiple queues can be used for handling different type of tasks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you talk about this earlier in the post?

* Multiple queues can be used for handling different type of tasks.
* Every task can send some meta information about its status or progress so far to Redis. This information can be useful for getting an insight into a task that runs for a long duration.

Thanks for following along and I hope this post would have been useful for you. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this line adds anything.

@sidhanthp sidhanthp self-assigned this Aug 1, 2018
@divyanshutomar
Copy link
Author

@sidhanthp I have made the required changes. Kindly review and let me know if anything else needs to be worked upon.

Copy link
Contributor

@sidhanthp sidhanthp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job with the post, I've left you with a few comments that I think make the article a little easier to read and digest.

I found it difficult to follow the longer paragraphs. Some of the paragraphs are big blocks of texts, which I think causes your eyes to glaze over the text. If you break up the paragraphs, or make the takeaways obvious using italics, I think it would make the article a little easier to follow.

---
# Running background tasks in Python using task queues

Let's say we have an e-commerce website where users can place orders for various products. Now there's a business requirement of finding out the kind of orders being placed and the most in-demand product in real-time. Also, for every order placed, the buyer should be conveyed of the confirmation of the order via an email or notification through a messaging service. In this case,if we process the order information on the same REST API service handling product requests, it may lead to significant problems. The API service may not able to respond in a short time as it can be blocked on the external services like the messaging service. This synchronous model worsens with a high number of orders being placed in a short span of time as the service may be preoccupied processing previous requests. Thus, these compute-intensive time-taking tasks like processing of orders on an e-commerce platform require an asynchronous approach.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too wordy.

Before writing, I would think through what you want the reader to take away from the paragraph. As the first paragraph, you introduce the reader to the e-commerce website, but they've clicked on the article to learn about 'background tasks in Python' and I think it would be more effective starting with that. You only touch about the importance of background tasks at the end of the paragraph.

Imagine I'm a reader who doesn't know what background tasks are / why I would want them, how would you introduce me with as little background?


Task or message queues are quite popular among microservices architecture. They enable each microservice to perform its dedicated task and work as a medium for inter-microservice communication. These queues store messages or data incoming from _producer_ microservices which can be processed or consumed by _consumer_ microservices. In the e-commerce example above, the REST API handling orders is a producer microservice which pushes these orders to the queue. Whereas, a data analysis microservice determining the kind of orders being placed or the messaging service can be considered a consumer microservice.

## Queueing frameworks to the rescue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section would be more effective at the bottom. First, teach the reader why they need background tasks, about RQ, then point them to the other solutions.


Let's say we have an e-commerce website where users can place orders for various products. Now there's a business requirement of finding out the kind of orders being placed and the most in-demand product in real-time. Also, for every order placed, the buyer should be conveyed of the confirmation of the order via an email or notification through a messaging service. In this case,if we process the order information on the same REST API service handling product requests, it may lead to significant problems. The API service may not able to respond in a short time as it can be blocked on the external services like the messaging service. This synchronous model worsens with a high number of orders being placed in a short span of time as the service may be preoccupied processing previous requests. Thus, these compute-intensive time-taking tasks like processing of orders on an e-commerce platform require an asynchronous approach.

The execution of such tasks or jobs can be performed in the background by another process spawned for the sole purpose. These processes are usually called _workers_. They run concurrently with the main process (web server in our case) handling client requests. To list all the tasks which need to be executed, a job/task queue is maintained to store tasks along with its metadata created by incoming requests on the web server. The worker process then executes these tasks chronologically. This modular approach makes it easier for the web server to accommodate execution of such long-running tasks as it will not get blocked itself in doing so. This also means that the web server can respond to forthcoming client requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to make this more concise. Here's an example -->

Workers can be used to execute these tasks in the background. They run concurrently in the background using a queue, and the worker executes the tasks chronologically.

This modular approach prevents the web server from being blocked from responding to incoming client requests.


## Real World Application

We will be writing a flask-based web application which retrieves _Goodreads_ book information like title, author, rating and description. The web server exposes an endpoint that accepts book URLs. A function will crawl and parse this URL for meta information of the book. As this function will take time to execute and may lead to blocking of the main thread, we will execute it asynchronously by pushing it to Redis queue (RQ). RQ allows us to enqueue multiple function calls to a queue which can be executed parallelly by a separate worker process. It requires Redis server as a message broker for performing this operation. Let's get into the code and learn how we can use Redis queue in our web applications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break the paragraph here. I think it makes it easier to follow.

"""
... pushing it to Redis queue (RQ).

RQ allows us to enqueue ...
"""

return dict(title=title.strip() if title else '',author=author.strip() if author else '',rating=float(rating.strip() if rating else 0),description=description)
```

We can now write a function called `parse_and_persist_book_info` that calls the above parsing function and persists the value to Redis so that it can be retrieved later. This function along with its arguments will be pushed to queue so that the worker process can execute it. Redis is a key-value store where the key should be unique else it may lead to overwriting of a previous value. Here `generate_redis_key_for_book` is a function that generates a unique key for a given book URL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduce a paragraph break here -->

"""
... process can execute it.

Redis is a key-value ...
"""

#........

# This generates a unique Redis key against a book URL
generate_redis_key_for_book = lambda bookURL: 'GOODREADS_BOOKS_INFO:' + bookURL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would explain in a little more detail what you're doing here if this is a major piece of the post.

redisKey = generate_redis_key_for_book(bookUrl) # get Redis key for given book URL
bookInfo = parse_book_link_for_meta_data(bookUrl) # get book meta information from parsing function above
# Set the value to Redis. Here pickle serializes the dictionary
redisClient.set(redisKey,pickle.dumps(bookInfo))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would explain redisClient.set(redisKey,pickle.dumps(bookInfo)) further. I could be missing it, but I don't see where you initialized redisClient.


### The endpoint for accepting URLs

Let's set up an endpoint that will accept a list of valid Goodreads book URLs. This is going to support POST method with URLs accepted as an array in _application/json_ body format. For validating the Goodreads book URLs we check for unique occurrences of URLs which starts with the string `https://www.goodreads.com/book/show/`. After the validation check, all valid URLs are pushed to Redis queue for parsing information. Here the method `enqueue_call` of Redis queue instance takes in a function that will be executed by worker process along with required arguments of the function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would take out the first couple sentences, and make the paragraph this -->
After the validation check, all valid URLs are pushed to Redis queue for parsing information. Here the method enqueue_call of Redis queue instance takes in a function that will be executed by worker process along with required arguments of the function.

I don't think the first couple sentences add to the explanation of Redis Queues.

@divyanshutomar
Copy link
Author

@sidhanthp Updated

You moved the wrong paragraph to the bottom, I fixed it.
Updated structure to 'move up' information dedicated to Redis Queues and removed information that was specific to the starter application.
@divyanshutomar
Copy link
Author

@sidhanthp Great job with the final touches. I guess we decided to have an example of a background job in the introductory para, that's why I kept the e-commerce example. Nevertheless, if you want to keep it concise the current state looks good to me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants