khpeek · XiaoSX · Oct 25, 2019 · Nov 1, 2019
diff --git a/README.md b/README.md
@@ -1,35 +1,14 @@
-# trulia-scraper
-Scraper for real estate listings on [Trulia.com](https://www.trulia.com/) implemented in Python with Scrapy.
+# Update trulia-scraper
+Details please refer to [khpeek](https://github.com/khpeek/trulia-scraper)
 
 ## Basic usage
-To crawl the scraper, you need to install [Python 3](https://www.python.org/download/releases/3.0/), as well as the [Scrapy](https://pypi.python.org/pypi/Scrapy) framework and the [Pyparsing](https://pypi.python.org/pypi/pyparsing/2.2.0) module. The scraper features two spiders:
-
 1. `trulia`, which scrapes all real estate listings which are _for sale_ in a given state and city starting from a URL such as [https://www.trulia.com/CA/San_Francisco/](https://www.trulia.com/CA/San_Francisco/);
-2. `trulia_sold`, which similarly scrapes listings of recently _sold_ properties starting from a URL such as [https://www.trulia.com/sold/San_Francisco,CA/](https://www.trulia.com/sold/San_Francisco,CA/).
-
-To crawl the `trulia_sold` spider for the state of `CA` and city of `San_Francisco` (the default locale), simply run the command
-
-```
-scrapy crawl trulia_sold
-```
-from the project directory. To scrape listings for another city, specify the `city` and `state` arguments using the `-a` flag. For example,
-
-```
-scrapy crawl trulia_sold -a state=NY -a city=New_York
-```
-will scrape all listings reachable from [https://www.trulia.com/sold/New_York,NY/](https://www.trulia.com/sold/New_York,NY/).
+'scrapy crawl trulia -output data.csv'
 
-By default, the scraped data will be stored (using Scrapy's [feed export](https://doc.scrapy.org/en/latest/topics/feed-exports.html)) in the `data` directory as a [JSON lines](http://jsonlines.org/) (`.jl`) file following the naming convention
+## To do
+1. `trulia_sold`, which similarly scrapes listings of recently _sold_ properties starting from a URL such as [https://www.trulia.com/sold/San_Francisco,CA/](https://www.trulia.com/sold/San_Francisco,CA/).
 
-```
-data_{sold|for_sale}_{state}_{city}_{time}.jl
-```
 
-where `{sold|for_sale}` is `sold` or `for_sale` for the `trulia` and `trulia_sold` spiders, respectively, `{state}` and `{city}` are the specified state and city (e.g. `CA` and `San_Francisco`, respectively), and `{time}` represents the current UTC time.
-
-If you prefer a different output file name and format, you can specify this from the command line using Scrapy's `-o` option. For example,
-
-```
-scrapy crawl trulia_sold -a state=WA -city=Seattle -o data_Seattle.csv
-```
-will output the data in CSV format as `data_Seattle.csv`. (Scrapy automatically picks up the file format from the specified file extension).
+# Selenium+Bs4 for redfin
+## Basic usage
+`python redfin_run.py`
diff --git a/csv_to_json.py b/csv_to_json.py
@@ -0,0 +1,43 @@
+# -*- coding: utf-8 -*-
+# @Time    : 19-10-31 下午3:31
+# @Author  : RenMeng
+
+import pandas as pd
+import json
+import re
+
+data = pd.read_csv('data.csv')
+data = data.fillna('')
+nrows = data.shape[0]
+
+for j in range(nrows):
+
+    webdata = data.iloc[j].to_dict()
+    tars = ['overview', 'property_taxes', 'new_homes',
+            'price_history', 'similar_homes', 'new_listing', 'comparable_sales']
+
+    for key in webdata:
+        if key == 'local_commons':
+            webdata[key] = re.sub('[^0-9A-Za-z!\?\.,:\"\"\'\' \n]', '', webdata[key])
+        if key in tars and webdata[key] != '':
+            try:
+                webdata[key] = eval(webdata[key])
+            except:
+                v = webdata[key]
+                v = re.findall('\((datetime.*?[^0-9])\)', v)
+                new_v = []
+                for _ele in v:
+                    new_ele = re.sub('\)|datetime\.datetime\(', '', _ele).split(',')
+                    new_v.append(['-'.join([i.strip() for i in new_ele[:3]]), new_ele[-2], eval(new_ele[-1])])
+                webdata[key] = new_v
+        elif key != '':
+            ele = webdata[key].split('\n')
+            if len(ele) > 1:
+                webdata[key] = ele
+
+        if key == 'comparable_sales':
+            webdata[key] = [{ele[0]: ele[1] for ele in line} for line in webdata[key]]
+
+
+    open('./result/trulia/trulia_output_{:d}.json'.format(j), 'w', encoding='utf-8').\
+                    write(json.dumps(webdata, indent=4, ensure_ascii=False))
diff --git a/proxies.txt b/proxies.txt