Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 8 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,14 @@
# trulia-scraper
Scraper for real estate listings on [Trulia.com](https://www.trulia.com/) implemented in Python with Scrapy.
# Update trulia-scraper
Details please refer to [khpeek](https://github.com/khpeek/trulia-scraper)

## Basic usage
To crawl the scraper, you need to install [Python 3](https://www.python.org/download/releases/3.0/), as well as the [Scrapy](https://pypi.python.org/pypi/Scrapy) framework and the [Pyparsing](https://pypi.python.org/pypi/pyparsing/2.2.0) module. The scraper features two spiders:

1. `trulia`, which scrapes all real estate listings which are _for sale_ in a given state and city starting from a URL such as [https://www.trulia.com/CA/San_Francisco/](https://www.trulia.com/CA/San_Francisco/);
2. `trulia_sold`, which similarly scrapes listings of recently _sold_ properties starting from a URL such as [https://www.trulia.com/sold/San_Francisco,CA/](https://www.trulia.com/sold/San_Francisco,CA/).

To crawl the `trulia_sold` spider for the state of `CA` and city of `San_Francisco` (the default locale), simply run the command

```
scrapy crawl trulia_sold
```
from the project directory. To scrape listings for another city, specify the `city` and `state` arguments using the `-a` flag. For example,

```
scrapy crawl trulia_sold -a state=NY -a city=New_York
```
will scrape all listings reachable from [https://www.trulia.com/sold/New_York,NY/](https://www.trulia.com/sold/New_York,NY/).
'scrapy crawl trulia -output data.csv'

By default, the scraped data will be stored (using Scrapy's [feed export](https://doc.scrapy.org/en/latest/topics/feed-exports.html)) in the `data` directory as a [JSON lines](http://jsonlines.org/) (`.jl`) file following the naming convention
## To do
1. `trulia_sold`, which similarly scrapes listings of recently _sold_ properties starting from a URL such as [https://www.trulia.com/sold/San_Francisco,CA/](https://www.trulia.com/sold/San_Francisco,CA/).

```
data_{sold|for_sale}_{state}_{city}_{time}.jl
```

where `{sold|for_sale}` is `sold` or `for_sale` for the `trulia` and `trulia_sold` spiders, respectively, `{state}` and `{city}` are the specified state and city (e.g. `CA` and `San_Francisco`, respectively), and `{time}` represents the current UTC time.

If you prefer a different output file name and format, you can specify this from the command line using Scrapy's `-o` option. For example,

```
scrapy crawl trulia_sold -a state=WA -city=Seattle -o data_Seattle.csv
```
will output the data in CSV format as `data_Seattle.csv`. (Scrapy automatically picks up the file format from the specified file extension).
# Selenium+Bs4 for redfin
## Basic usage
`python redfin_run.py`
43 changes: 43 additions & 0 deletions csv_to_json.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# -*- coding: utf-8 -*-
# @Time : 19-10-31 下午3:31
# @Author : RenMeng

import pandas as pd
import json
import re

data = pd.read_csv('data.csv')
data = data.fillna('')
nrows = data.shape[0]

for j in range(nrows):

webdata = data.iloc[j].to_dict()
tars = ['overview', 'property_taxes', 'new_homes',
'price_history', 'similar_homes', 'new_listing', 'comparable_sales']

for key in webdata:
if key == 'local_commons':
webdata[key] = re.sub('[^0-9A-Za-z!\?\.,:\"\"\'\' \n]', '', webdata[key])
if key in tars and webdata[key] != '':
try:
webdata[key] = eval(webdata[key])
except:
v = webdata[key]
v = re.findall('\((datetime.*?[^0-9])\)', v)
new_v = []
for _ele in v:
new_ele = re.sub('\)|datetime\.datetime\(', '', _ele).split(',')
new_v.append(['-'.join([i.strip() for i in new_ele[:3]]), new_ele[-2], eval(new_ele[-1])])
webdata[key] = new_v
elif key != '':
ele = webdata[key].split('\n')
if len(ele) > 1:
webdata[key] = ele

if key == 'comparable_sales':
webdata[key] = [{ele[0]: ele[1] for ele in line} for line in webdata[key]]


open('./result/trulia/trulia_output_{:d}.json'.format(j), 'w', encoding='utf-8').\
write(json.dumps(webdata, indent=4, ensure_ascii=False))
Empty file added proxies.txt
Empty file.
Loading