This script processes JSON files from Serper.dev Playground and outputs the combined data into different formats while filtering out specific TLDs and domains/strings, and removing duplicates.
I made it for friends who want to use the Playground and get data quickly, without having to boot up the API.
- Python 3
- pandas
-
Clone the repository:
git clone https://github.com/garetharnold/serparse.gitcd serparse -
Install the required packages:
pip3 install pandas -
(OPTIONAL) Download the
tlds.jsonfile from the following URL and place it in the same directory asserparse.py: -
You can create your own
tlds.jsonfile for TLD filtering:{ "blocked_tlds": [ { "tld": ".xyz", "description": "Commonly associated with spam and low-quality emails." }, { "tld": ".top", "description": "Commonly used by spammers." }, { "tld": ".win", "description": "Frequently used for spam and fraudulent activities." }, { "tld": ".vip", "description": "Known for a high volume of spam." }, { "tld": ".click", "description": "Often used in phishing and spam emails." } ] } -
You can create your own
blacklist.jsonfile for domain or string filtering:
{
"blocked_domains": [
"amazon.com",
"ebay.com",
"example.com",
"string",
"news",
"wikipedia"
]
}To run the script:
python3 serparse.py -i filename1.json filename2.json
Replace filename1.json and filename2.json with your actual JSON files. The output will be in CSV format.
- Default: JSON
- CSV:
-o csv - URL CSV:
-o urls
python3 serparse.py -i input1.json input2.json -o csv
This will process the input files, filter out unwanted TLDs and domains/strings, remove duplicates, and save the output in CSV format with a timestamped filename.
- Input Files: The script accepts multiple JSON input files specified with the
-ior--inputoption. - Output Format: The output format can be specified with the
-oor--outputoption. Supported formats arecsv,urls, andjson(default). - Filtering: The script filters out entries based on TLDs specified in
tlds.jsonand domains/strings specified inblacklist.json. - Duplicate Removal: Duplicates are removed based on the domain.
- Logging: The script generates a log file that records which entries were removed due to duplicates or filtering.