trulia-scraper

Scraper for real estate listings on Trulia.com implemented in Python with Scrapy.

Basic usage

To crawl the scraper, you need to install Python 3, as well as the Scrapy framework and the Pyparsing module. The scraper features two spiders:

trulia, which scrapes all real estate listings which are for sale in a given state and city starting from a URL such as https://www.trulia.com/CA/San_Francisco/;
trulia_sold, which similarly scrapes listings of recently sold properties starting from a URL such as https://www.trulia.com/sold/San_Francisco,CA/.

To crawl the trulia_sold spider for the state of CA and city of San_Francisco (the default locale), simply run the command

scrapy crawl trulia_sold

from the project directory. To scrape listings for another city, specify the city and state arguments using the -a flag. For example,

scrapy crawl trulia_sold -a state=NY -a city=New_York

will scrape all listings reachable from https://www.trulia.com/sold/New_York,NY/.

By default, the scraped data will be stored (using Scrapy's feed export) in the data directory as a JSON lines (.jl) file following the naming convention

data_{sold|for_sale}_{state}_{city}_{time}.jl

where {sold|for_sale} is sold or for_sale for the trulia and trulia_sold spiders, respectively, {state} and {city} are the specified state and city (e.g. CA and San_Francisco, respectively), and {time} represents the current UTC time.

If you prefer a different output file name and format, you can specify this from the command line using Scrapy's -o option. For example,

scrapy crawl trulia_sold -a state=WA -city=Seattle -o data_Seattle.csv

will output the data in CSV format as data_Seattle.csv. (Scrapy automatically picks up the file format from the specified file extension).

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
trulia_scraper		trulia_scraper
.gitignore		.gitignore
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

trulia-scraper

Basic usage

About

Uh oh!

Releases

Packages

Languages

khpeek/trulia-scraper

Folders and files

Latest commit

History

Repository files navigation

trulia-scraper

Basic usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages